La práctica a desarrollar consiste en la elaboración y presentación de un informe de un proyecto de Ciencia de Datos, utilizando las técnicas aprendidas durante la asignatura, aplicadas al conjunto de datos seleccionados.
El grupo eligió trabajar en lenguage R (RStudio version 1.4.1717) y utilizar como herramienta de control de versiones GitHub. El proyecto “/practica_ml” fue creado por Isabela Ignacio (usuario IsaPires1329)y compartido con los restantes participantes del grupo Luisa Yánez (usuario lyanezgu) y Miguel García (usuario mgarciasanc2021).
Link del proyecto en GitHub: https://github.com/IsaPires1329/practica_ml.git.
library(formatR)
library(readr)
library(ggplot2)
library(GGally)
library(dplyr)
library(tidyr)
library(missForest)
library(VIM)
library(formattable)
library(usmap)
library(cowplot)
library(corrplot)
library(MASS)
library(ggfortify)
library(nortest)
library(car)
library(lmtest)
library(PerformanceAnalytics)
library(Amelia)
library(ggthemes)
library(tidyverse)
library(tibble)
library(gridExtra)
library(ggbiplot)
library(factoextra)
library(caret)
library(ISLR)
library(rpart)
library(rpart.plot)
library(rattle)
library(tsne)
library(Rtsne)
library(class)
library(ada)
library(factoextra)
library(cluster)
library(useful)
library(mgcv)
library(xgboost)
library(randomForest)
library(kernlab)
library(pROC)
library(doMC)
library(ggpubr)
El conjunto de datos elegido por el grupo se llama “Red Wine Quality” e incluye información sobre la variantes de vino tinto dentro del “Vinho Verde” portugués analizándolo y describiéndolo a través de sus características fisicoquímicas y sensoriales.
Link del data set: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009.
El conjunto de datos “Red Wine Quality” contiene 12 columnas y 1599 filas y lo obtenemos en formato .CSV.
Inicialmente se han guardado los datos en un data frame llamado “red_wine” y se ha realizado un estudio inicial sobre su contenido utilizando la función head y summary.
red_wine <- read_csv("winequality-red.csv")
head(red_wine)
## # A tibble: 6 × 12
## `fixed acidity` `volatile acidity` `citric acid` `residual sugar` chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## # … with 7 more variables: free sulfur dioxide <dbl>,
## # total sulfur dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
summary(red_wine)
## fixed acidity volatile acidity citric acid residual sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free sulfur dioxide total sulfur dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
Empezando ya el análisis inicial del conjunto de datos que tenemos, vemos que las 12 variables que componen los datos pueden ser descritas como:
Input variables o Varibles de entrada/predictoras (basado en pruebas fisicoquímicas):
Output variable o Variable de salida/respuesta/objetivo (basado en datos sensoriales):
El objetivo final del proyecto es conseguir llegar a un modelo que permita predecir la calidad del vino tinto de la variedad portuguesa de “Vinho Verde” y saber si estamos ante vinos recomendables (aprobados/bebibles) o no recomendables y que se deberían evitar (suspensos/no bebibles).
Se ha decidido realizar un cambio en el nombre de las variables que aparecen en las columnas de los datos para así seguir un mismo patrón y a la vez evitar tener espacios que nos pueden llegar a dar problemas a futuro.
names(red_wine) <- c("fixed_acidity", "volatile_acidity", "citric_acid",
"residual_sugar", "chlorides", "free_sulfur_dioxide", "total_sulfur_dioxide",
"density", "pH", "sulphates", "alcohol", "quality")
head(red_wine)
## # A tibble: 6 × 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## # … with 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
Todas las variables input de las que disponemos en el dataset son de tipo numérico y entendemos que en principio no requieren ninguna transformación en ese sentido.
Cabría la posibilidad de tratar de transformar la variable “quality” (output) en factor para hacerla categórica en función de la calidad del vino (clasificación del vino en números enteros entre el 0 y el 10). Se podría pasar a categorizar el vino como “malo”, “normal” y “bueno”, como “apobado” o “suspenso”, o del 0 al 10 en las diferentes categorías numéricas que vienen predefinidas.
str(red_wine)
## spec_tbl_df [1,599 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile_acidity : num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric_acid : num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual_sugar : num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free_sulfur_dioxide : num [1:1599] 11 25 15 17 11 13 15 15 9 17 ...
## $ total_sulfur_dioxide: num [1:1599] 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num [1:1599] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : num [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
## - attr(*, "spec")=
## .. cols(
## .. `fixed acidity` = col_double(),
## .. `volatile acidity` = col_double(),
## .. `citric acid` = col_double(),
## .. `residual sugar` = col_double(),
## .. chlorides = col_double(),
## .. `free sulfur dioxide` = col_double(),
## .. `total sulfur dioxide` = col_double(),
## .. density = col_double(),
## .. pH = col_double(),
## .. sulphates = col_double(),
## .. alcohol = col_double(),
## .. quality = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
A través de la función summary empezamos comprobando que no hay datos faltantes en el data set. Por ello el grupo ha tenido que añadirlos manualmente para tratar de aproximarnos a un caso más real donde lo normal es encontrarlos y tener que lidiar con ellos.
Los datos faltantes han sido imputados exclusivamente en las columnas que no creemos que no van a servir de análisis principal para este estudio (pH y sulphates), para así intentar que la predicción que hagamos sea lo más precisa posible.
Utilizamos la librería missForest y generamos una semilla para que el resultado sea siempre el mismo.
red_wine
## # A tibble: 1,599 × 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## 7 7.9 0.6 0.06 1.6 0.069
## 8 7.3 0.65 0 1.2 0.065
## 9 7.8 0.58 0.02 2 0.073
## 10 7.5 0.5 0.36 6.1 0.071
## # … with 1,589 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
set.seed(101)
red_wine <- bind_cols(red_wine[c(1, 2, 3, 4, 5, 6, 7, 8, 11,
12)], missForest::prodNA(red_wine[c(-1, -2, -3, -4, -5, -6,
-7, -8, -11, -12)], noNA = 0.1))
red_wine
## # A tibble: 1,599 × 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## 7 7.9 0.6 0.06 1.6 0.069
## 8 7.3 0.65 0 1.2 0.065
## 9 7.8 0.58 0.02 2 0.073
## 10 7.5 0.5 0.36 6.1 0.071
## # … with 1,589 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, alcohol <dbl>, quality <dbl>,
## # pH <dbl>, sulphates <dbl>
Haciendo uso de la librería VIM y de la librería Amelia, analizamos la estructura que tienen nuestros datos faltantes dentro de nuestro data set para ver y entender como se distribuyen y a que variables afecta.
Se puede comprobar que la proporción de datos faltantes en estas variables es de aproximadamente 10% y hay 15 filas en que las dos variables son faltantes.
summary(aggr(red_wine, numbers = T, sortVar = T))
##
## Variables sorted by number of missings:
## Variable Count
## pH 0.10318949
## sulphates 0.09631019
## fixed_acidity 0.00000000
## volatile_acidity 0.00000000
## citric_acid 0.00000000
## residual_sugar 0.00000000
## chlorides 0.00000000
## free_sulfur_dioxide 0.00000000
## total_sulfur_dioxide 0.00000000
## density 0.00000000
## alcohol 0.00000000
## quality 0.00000000
##
## Missings per variable:
## Variable Count
## fixed_acidity 0
## volatile_acidity 0
## citric_acid 0
## residual_sugar 0
## chlorides 0
## free_sulfur_dioxide 0
## total_sulfur_dioxide 0
## density 0
## alcohol 0
## quality 0
## pH 165
## sulphates 154
##
## Missings in combinations of variables:
## Combinations Count Percent
## 0:0:0:0:0:0:0:0:0:0:0:0 1295 80.9881176
## 0:0:0:0:0:0:0:0:0:0:0:1 139 8.6929331
## 0:0:0:0:0:0:0:0:0:0:1:0 150 9.3808630
## 0:0:0:0:0:0:0:0:0:0:1:1 15 0.9380863
missmap(red_wine, main = "Missing Map")
Una vez vistos por encima la estructura general de los datos y habiendo añadido los datos faltantes que nos hacian falta, pasamos a dividir el conjunto de datos en dos para diferenciar los que usaremos de entrenamiento de los que usaremos de test (viendo la cantidad de datos de la que disponemos, la distribución elegida ha sido: 20% test y 80% training). Establecemos una semilla que nos guarde de forma permanente la división que hacemos para que la distribución de los datos sea siempre la misma.
Guardamos además la partición de datos de test para ser utilizada a futuro para la validación del modelo final y pasamos a trabajar de aquí en adelante con la partición de training.
set.seed(101)
sample <- sample.int(n = nrow(red_wine), size = floor(0.8 * nrow(red_wine)),
replace = F)
train <- red_wine[sample, ]
test <- red_wine[-sample, ]
train
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 2.8 0.068
## 2 7.6 0.49 0.33 1.9 0.074
## 3 5 1.02 0.04 1.4 0.045
## 4 7.6 0.43 0.29 2.1 0.075
## 5 6.8 0.59 0.1 1.7 0.063
## 6 6.8 0.815 0 1.2 0.267
## 7 8.5 0.21 0.52 1.9 0.09
## 8 7.4 0.36 0.29 2.6 0.087
## 9 5.5 0.49 0.03 1.8 0.044
## 10 6.8 0.49 0.22 2.3 0.071
## # … with 1,269 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, alcohol <dbl>, quality <dbl>,
## # pH <dbl>, sulphates <dbl>
test
## # A tibble: 320 × 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.3 0.65 0 1.2 0.065
## 3 8.9 0.22 0.48 1.8 0.077
## 4 7.6 0.41 0.24 1.8 0.08
## 5 7.1 0.71 0 1.9 0.08
## 6 5.7 1.13 0.09 1.5 0.172
## 7 7.3 0.45 0.36 5.9 0.074
## 8 8.1 0.66 0.22 2.2 0.069
## 9 6.8 0.67 0.02 1.8 0.05
## 10 5.6 0.31 0.37 1.4 0.074
## # … with 310 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, alcohol <dbl>, quality <dbl>,
## # pH <dbl>, sulphates <dbl>
Para la imputación de datos faltantes en las columnas “pH” y “sulphates”, se ha decidido reemplazar todos sus NAs según los valores medianos de las mismas variables.
Con la función summary se comprueba que ya no hay más datos faltantes en el data set train.
train$pH[is.na(train$pH)] <- median(train$pH, na.rm = TRUE)
train$sulphates[is.na(train$sulphates)] <- median(train$sulphates,
na.rm = TRUE)
summary(train)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. : 0.900
## 1st Qu.: 7.100 1st Qu.:0.3900 1st Qu.:0.1000 1st Qu.: 1.900
## Median : 7.900 Median :0.5200 Median :0.2600 Median : 2.200
## Mean : 8.357 Mean :0.5262 Mean :0.2732 Mean : 2.552
## 3rd Qu.: 9.300 3rd Qu.:0.6300 3rd Qu.:0.4300 3rd Qu.: 2.600
## Max. :15.900 Max. :1.5800 Max. :0.7900 Max. :15.500
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.0120 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.0710 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.0800 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.0882 Mean :15.86 Mean : 46.44 Mean :0.9968
## 3rd Qu.:0.0910 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9979
## Max. :0.6110 Max. :68.00 Max. :289.00 Max. :1.0037
## alcohol quality pH sulphates
## Min. : 8.40 Min. :3.000 Min. :2.860 Min. :0.3300
## 1st Qu.: 9.50 1st Qu.:5.000 1st Qu.:3.220 1st Qu.:0.5600
## Median :10.20 Median :6.000 Median :3.300 Median :0.6200
## Mean :10.43 Mean :5.635 Mean :3.308 Mean :0.6526
## 3rd Qu.:11.10 3rd Qu.:6.000 3rd Qu.:3.380 3rd Qu.:0.7100
## Max. :14.90 Max. :8.000 Max. :4.010 Max. :1.9500
Analizamos como se distribuyen las diferentes variables de nuestro dataset.
train %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value, fill = key)) + facet_wrap(~key, scales = "free") +
geom_histogram(bins = sqrt(nrow(train))) + theme(legend.position = "none")
A partir de las gráficas podemos ver que algunas de las variables están distribuidas de forma normal, y parte de las variables están sesgadas a la derecha.
La distribución de “fixed_acidity” y “volatile_acidity” es muy similar, lo que indica que hay ciertas similitudes entre los dos indicadores fisicoquímicos.
Las variables “density” y el “pH” se distribuyen normalmente, lo que indica que todos los vinos tintos tienen poca diferencia en estos dos indicadores. No se requiere por tanto transformación alguna de su distribución.
Las variables “residual_sugar”, “chlorides”, “free_sulfur_dioxide”, “total_sulfur_dioxide”, and “sulphates” están muy sesgadas, por lo qye sería conveniente transformarlas para que la distribución de sus valores fuese más homogénea. Este resultado se consigue aplicando una transformación logarítmica y normalizando de esa manera sus distribuciones:
train <- train %>%
mutate(Log_residual_sugar = log(residual_sugar), Log_chlorides = log(chlorides),
Log_free_sulfur_dioxide = log(free_sulfur_dioxide), Log_total_sulfur_dioxide = log(total_sulfur_dioxide),
Log_sulphates = log(sulphates))
ga <- train %>%
ggplot(aes(x = Log_residual_sugar)) + geom_histogram(bins = 20,
fill = "#619CFF")
gb <- train %>%
ggplot(aes(x = Log_chlorides)) + geom_histogram(bins = 20,
fill = "#E58700")
gc <- train %>%
ggplot(aes(x = Log_free_sulfur_dioxide)) + geom_histogram(bins = 20,
fill = "#00BF7D")
gd <- train %>%
ggplot(aes(x = Log_total_sulfur_dioxide)) + geom_histogram(bins = 20,
fill = "#FD61D1")
ge <- train %>%
ggplot(aes(x = Log_sulphates)) + geom_histogram(bins = 20,
fill = "#B983FF")
grid.arrange(ga, gb, gc, gd, ge)
Modificamos nuestro dataset original para que las variables transformadas a logaritmos sustituyan a las mismas pero aún sin transformar. Tendremos de ese modo un dataset con 12 variables también, pero 5 de ellas transformadas a logaritmos.
train <- train %>%
dplyr::select(-residual_sugar, -chlorides, -free_sulfur_dioxide,
-total_sulfur_dioxide, -sulphates)
train %>%
summary
## fixed_acidity volatile_acidity citric_acid density
## Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. :0.9901
## 1st Qu.: 7.100 1st Qu.:0.3900 1st Qu.:0.1000 1st Qu.:0.9956
## Median : 7.900 Median :0.5200 Median :0.2600 Median :0.9968
## Mean : 8.357 Mean :0.5262 Mean :0.2732 Mean :0.9968
## 3rd Qu.: 9.300 3rd Qu.:0.6300 3rd Qu.:0.4300 3rd Qu.:0.9979
## Max. :15.900 Max. :1.5800 Max. :0.7900 Max. :1.0037
## alcohol quality pH Log_residual_sugar
## Min. : 8.40 Min. :3.000 Min. :2.860 Min. :-0.1054
## 1st Qu.: 9.50 1st Qu.:5.000 1st Qu.:3.220 1st Qu.: 0.6419
## Median :10.20 Median :6.000 Median :3.300 Median : 0.7885
## Mean :10.43 Mean :5.635 Mean :3.308 Mean : 0.8554
## 3rd Qu.:11.10 3rd Qu.:6.000 3rd Qu.:3.380 3rd Qu.: 0.9555
## Max. :14.90 Max. :8.000 Max. :4.010 Max. : 2.7408
## Log_chlorides Log_free_sulfur_dioxide Log_total_sulfur_dioxide
## Min. :-4.4228 Min. :0.000 Min. :1.792
## 1st Qu.:-2.6451 1st Qu.:1.946 1st Qu.:3.091
## Median :-2.5257 Median :2.639 Median :3.638
## Mean :-2.4980 Mean :2.545 Mean :3.604
## 3rd Qu.:-2.3969 3rd Qu.:3.045 3rd Qu.:4.127
## Max. :-0.4927 Max. :4.220 Max. :5.666
## Log_sulphates
## Min. :-1.1087
## 1st Qu.:-0.5798
## Median :-0.4780
## Mean :-0.4496
## 3rd Qu.:-0.3425
## Max. : 0.6678
Una vez realizadas las transformaciones logaritmicas oportunas sobre las 5 variables que lo requerían, volvemos a ver de forma general las distribuciones del conjunto total de variables:
train %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value, fill = key)) + facet_wrap(~key, scales = "free") +
geom_histogram(bins = sqrt(nrow(train))) + theme(legend.position = "none")
Analizamos en detalle como se distribuye la variable de salida “quality” referente a las puntuaciones de calidad de entre 0 y 10 de los vinos.
ggplot(data = train) + geom_bar(mapping = aes(x = quality, fill = as.factor(quality))) +
labs(title = "Histograma Calidad Vino")
table(train$quality)
##
## 3 4 5 6 7 8
## 7 38 552 513 156 13
prop.table(table(train$quality))
##
## 3 4 5 6 7 8
## 0.005473026 0.029710711 0.431587177 0.401094605 0.121970289 0.010164191
Con la gráfica y los datos podemos ver que la mayor parte de los vinos (sobre un 83% de ellos) están clasificados con valores de calidad de 5 y 6, sobre calificaciones que van de 0 a 10.
Analizamos si nuestras variables tienen valores atípicos, cuales son sus valores medios y vemos sus intervalos de confianza, a través de gráficos de tipo Boxplot.
Boxplot variable alcohol
BoxPlot_alcohol <- ggplot(train, aes(x = factor(quality), y = alcohol)) +
geom_boxplot() + geom_boxplot(fill = "#F8766D") + ggtitle("Boxplot alcohol")
BoxPlot_alcohol
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general mayor % de alcohol.
Boxplot variable citric_acid
BoxPlot_citric_acid <- ggplot(train, aes(x = factor(quality),
y = citric_acid)) + geom_boxplot() + geom_boxplot(fill = "#E58700") +
ggtitle("Boxplot citric_acid")
BoxPlot_citric_acid
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general mayor cantidad de ácido cítrico.
Boxplot variable density
BoxPlot_density <- ggplot(train, aes(x = factor(quality), y = density)) +
geom_boxplot() + geom_boxplot(fill = "#C99800") + ggtitle("Boxplot density")
BoxPlot_density
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general una leve menor densidad, pero no es una variable determinante en la calidad del producto.
Boxplot variable fixed_acidity
BoxPlot_fixed_acidity <- ggplot(train, aes(x = factor(quality),
y = fixed_acidity)) + geom_boxplot() + geom_boxplot(fill = "#6BB100") +
ggtitle("Boxplot fixed_acidity")
BoxPlot_fixed_acidity
Apreciamos que la variable “fixed_acidity” se mantiene bastante estable independientemente de la calidad final del vino, sin tener grandes diferencias entre los diferentes rangos de calidad.
Boxplot variable Log_chlorides
BoxPlot_Log_chlorides <- ggplot(train, aes(x = factor(quality),
y = Log_chlorides)) + geom_boxplot() + geom_boxplot(fill = "#00BA38") +
ggtitle("Boxplot Log_chlorides")
BoxPlot_Log_chlorides
Apreciamos que la variable “Log_chlorides” se mantiene bastante estable independientemente de la calidad final del vino, sin tener grandes diferencias entre los diferentes rangos de calidad.
Boxplot variable Log_free_sulfur_dioxide
BoxPlot_Log_free_sulfur_dioxide <- ggplot(train, aes(x = factor(quality),
y = Log_free_sulfur_dioxide)) + geom_boxplot() + geom_boxplot(fill = "#00BF7D") +
ggtitle("Boxplot Log_free_sulfur_dioxide")
BoxPlot_Log_free_sulfur_dioxide
No se aprecia una tendencia específica en la variable “Log_free_sulfur_dioxide” que sea decisiva en la calidad del vino.
Boxplot variable Log_residual_sugar
BoxPlot_Log_residual_sugar <- ggplot(train, aes(x = factor(quality),
y = Log_residual_sugar)) + geom_boxplot() + geom_boxplot(fill = "#00C0AF") +
ggtitle("Boxplot Log_residual_sugar")
BoxPlot_Log_residual_sugar
Apreciamos que la variable “Log_residual_sugar” se mantiene bastante estable independientemente de la calidad final del vino, sin tener grandes diferencias entre los diferentes rangos de calidad.
Boxplot variable Log_sulphates
BoxPlot_Log_sulphates <- ggplot(train, aes(x = factor(quality),
y = Log_sulphates)) + geom_boxplot() + geom_boxplot(fill = "#00BCD8") +
ggtitle("Boxplot Log_sulphates")
BoxPlot_Log_sulphates
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general mayor cantidad de la variable “Log_sulphates”, aunque existen bastantes outlier e puntuaciones de 5 y 6, que podrían llevar a error.
Boxplot variable Log_total_sulfur_dioxide
BoxPlot_Log_total_sulfur_dioxide <- ggplot(train, aes(x = factor(quality),
y = Log_total_sulfur_dioxide)) + geom_boxplot() + geom_boxplot(fill = "#00B0F6") +
ggtitle("Boxplot Log_total_sulfur_dioxide")
BoxPlot_Log_total_sulfur_dioxide
No se aprecia una tendencia específica en la variable “Log_total_sulfur_dioxide” que sea decisiva en la calidad del vino.
Boxplot variable pH
BoxPlot_pH <- ggplot(train, aes(x = factor(quality), y = pH)) +
geom_boxplot() + geom_boxplot(fill = "#B983FF") + ggtitle("Boxplot pH")
BoxPlot_pH
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general un leve menor valor de pH,aunque existen numeros outliers en vinos puntuados con 5 y 6 que podrían llevar a error.
Boxplot variable volatile_acidity
BoxPlot_volatile_acidity <- ggplot(train, aes(x = factor(quality),
y = volatile_acidity)) + geom_boxplot() + geom_boxplot(fill = "#FF67A4") +
ggtitle("Boxplot volatile_acidity")
BoxPlot_volatile_acidity
Apreciamos que los vinos con mejor puntuación en “quality” tienen en general menor cantidad de “ácido cítrico”volatile_acidity".
Continuando con en análisis de las distintas variables del data set y el estudio de como se relacionan entre si, queremos ver de forma global como se correlacionan las variables numéricas que nos pueden llegar a servir para el modelo de predicción.
pairs(train)
corrplot(cor(train %>%
mutate(quality = as.numeric(quality)) %>%
keep(is.numeric)))
res <- cor(train %>%
mutate(quality = as.numeric(quality)) %>%
keep(is.numeric))
round(res, 2)
## fixed_acidity volatile_acidity citric_acid density
## fixed_acidity 1.00 -0.26 0.68 0.68
## volatile_acidity -0.26 1.00 -0.55 0.02
## citric_acid 0.68 -0.55 1.00 0.37
## density 0.68 0.02 0.37 1.00
## alcohol -0.05 -0.21 0.15 -0.49
## quality 0.14 -0.39 0.25 -0.16
## pH -0.64 0.22 -0.49 -0.32
## Log_residual_sugar 0.20 0.04 0.19 0.44
## Log_chlorides 0.16 0.09 0.16 0.33
## Log_free_sulfur_dioxide -0.18 0.03 -0.11 -0.04
## Log_total_sulfur_dioxide -0.12 0.08 -0.03 0.11
## Log_sulphates 0.19 -0.30 0.32 0.14
## alcohol quality pH Log_residual_sugar Log_chlorides
## fixed_acidity -0.05 0.14 -0.64 0.20 0.16
## volatile_acidity -0.21 -0.39 0.22 0.04 0.09
## citric_acid 0.15 0.25 -0.49 0.19 0.16
## density -0.49 -0.16 -0.32 0.44 0.33
## alcohol 1.00 0.49 0.18 0.06 -0.29
## quality 0.49 1.00 -0.07 0.03 -0.16
## pH 0.18 -0.07 1.00 -0.10 -0.26
## Log_residual_sugar 0.06 0.03 -0.10 1.00 0.12
## Log_chlorides -0.29 -0.16 -0.26 0.12 1.00
## Log_free_sulfur_dioxide -0.09 -0.03 0.08 0.10 -0.02
## Log_total_sulfur_dioxide -0.24 -0.16 -0.03 0.17 0.06
## Log_sulphates 0.13 0.33 -0.14 0.02 0.22
## Log_free_sulfur_dioxide Log_total_sulfur_dioxide
## fixed_acidity -0.18 -0.12
## volatile_acidity 0.03 0.08
## citric_acid -0.11 -0.03
## density -0.04 0.11
## alcohol -0.09 -0.24
## quality -0.03 -0.16
## pH 0.08 -0.03
## Log_residual_sugar 0.10 0.17
## Log_chlorides -0.02 0.06
## Log_free_sulfur_dioxide 1.00 0.79
## Log_total_sulfur_dioxide 0.79 1.00
## Log_sulphates 0.06 0.04
## Log_sulphates
## fixed_acidity 0.19
## volatile_acidity -0.30
## citric_acid 0.32
## density 0.14
## alcohol 0.13
## quality 0.33
## pH -0.14
## Log_residual_sugar 0.02
## Log_chlorides 0.22
## Log_free_sulfur_dioxide 0.06
## Log_total_sulfur_dioxide 0.04
## Log_sulphates 1.00
Vemos que las variables que más estan correlacionadas con la variable “quality” son: “volatile_acidity”, “citric_acid”, “alcohol” y “Log_sulphates”.
Realizamos un análisis bivariante para ver que variables están más correlacionadas, positva o negativamente, entre si.
Correlación: fixed_acidity y citric_acid:
cor(x = train$fixed_acidity, y = train$citric_acid)
## [1] 0.678372
train %>%
ggplot(aes(fixed_acidity, citric_acid)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables fixed_acidity y citric_acid",
x = "fixed_acidity", y = "citric_acid")
Correlación: fixed_acidity y density:
cor(x = train$fixed_acidity, y = train$density)
## [1] 0.6782196
train %>%
ggplot(aes(fixed_acidity, density)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables fixed_acidity y density",
x = "fixed_acidity", y = "density")
Correlación: fixed_acidity y pH:
cor(x = train$fixed_acidity, y = train$pH)
## [1] -0.644656
train %>%
ggplot(aes(fixed_acidity, pH)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables fixed_acidity y pH",
x = "fixed_acidity", y = "pH")
Correlación: citric_acid y volatile_acidity:
cor(x = train$citric_acid, y = train$volatile_acidity)
## [1] -0.5538307
train %>%
ggplot(aes(citric_acid, volatile_acidity)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables citric_acid y volatile_acidity",
x = "citric_acid", y = "volatile_acidity")
Correlación: citric_acid y pH:
cor(x = train$citric_acid, y = train$pH)
## [1] -0.4941459
train %>%
ggplot(aes(citric_acid, pH)) + geom_point(alpha = 0.2, colour = "green") +
geom_smooth(formula = "y ~ x", method = "lm") + labs(title = "Relación entre variables citric_acid y pH",
x = "citric_acid", y = "pH")
Correlación: density y Log_residual_sugar:
cor(x = train$density, y = train$Log_residual_sugar)
## [1] 0.4399375
train %>%
ggplot(aes(density, Log_residual_sugar)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables density y Log_residual_sugar",
x = "density", y = "Log_residual_sugar")
Correlación: density y alcohol:
cor(x = train$density, y = train$alcohol)
## [1] -0.4880924
train %>%
ggplot(aes(density, alcohol)) + geom_point(alpha = 0.2, colour = "green") +
geom_smooth(formula = "y ~ x", method = "lm") + labs(title = "Relación entre variables density y alcohol",
x = "density", y = "alcohol")
Correlación: quality y alcohol:
cor(x = train$quality, y = train$alcohol)
## [1] 0.4895963
train %>%
ggplot(aes(train$quality, train$alcohol)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables quality y alcohol",
x = "quality", y = "alcohol")
Correlación: quality y volatile_acidity:
cor(x = train$quality, y = train$volatile_acidity)
## [1] -0.3904367
train %>%
ggplot(aes(quality, volatile_acidity)) + geom_point(alpha = 0.2,
colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
labs(title = "Relación entre variables quality y volatile_acidity",
x = "quality", y = "volatile_acidity")
Correlación: Log_free_sulfur_dioxide y Log_total_sulfur_dioxide:
cor(x = train$Log_free_sulfur_dioxide, y = train$Log_total_sulfur_dioxide)
## [1] 0.7856495
train %>%
ggplot(aes(Log_free_sulfur_dioxide, Log_total_sulfur_dioxide)) +
geom_point(alpha = 0.2, colour = "green") + geom_smooth(formula = "y ~ x",
method = "lm") + labs(title = "Relación entre variables Log_free_sulfur_dioxide y Log_total_sulfur_dioxide",
x = "Log_free_sulfur_dioxide", y = "Log_total_sulfur_dioxide")
Realizamos los cambios y modificaciones necesarias sobre el conjunto de datos de test, aplicados previamente sobre nuestro dataset de train.
Imputamos los NAs del data set de test:
test$pH[is.na(test$pH)] <- median(test$pH, na.rm = TRUE)
test$sulphates[is.na(test$sulphates)] <- median(test$sulphates,
na.rm = TRUE)
summary(test)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 4.70 Min. :0.1600 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.080 1st Qu.: 1.900
## Median : 7.80 Median :0.5200 Median :0.250 Median : 2.150
## Mean : 8.17 Mean :0.5341 Mean :0.262 Mean : 2.486
## 3rd Qu.: 9.00 3rd Qu.:0.6600 3rd Qu.:0.420 3rd Qu.: 2.525
## Max. :15.00 Max. :1.2400 Max. :1.000 Max. :13.800
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.01200 Min. : 3.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.06800 1st Qu.: 7.00 1st Qu.: 20.00 1st Qu.:0.9957
## Median :0.07800 Median :14.00 Median : 37.00 Median :0.9967
## Mean :0.08452 Mean :15.95 Mean : 46.58 Mean :0.9967
## 3rd Qu.:0.08725 3rd Qu.:22.00 3rd Qu.: 63.25 3rd Qu.:0.9977
## Max. :0.61000 Max. :72.00 Max. :160.00 Max. :1.0024
## alcohol quality pH sulphates
## Min. : 8.80 Min. :3.000 Min. :2.740 Min. :0.3900
## 1st Qu.: 9.50 1st Qu.:5.000 1st Qu.:3.210 1st Qu.:0.5500
## Median :10.10 Median :6.000 Median :3.320 Median :0.6200
## Mean :10.39 Mean :5.641 Mean :3.315 Mean :0.6583
## 3rd Qu.:11.00 3rd Qu.:6.000 3rd Qu.:3.400 3rd Qu.:0.7100
## Max. :14.00 Max. :8.000 Max. :3.850 Max. :2.0000
Transformamos a logaritmicas las variables previamente normalizadas:
test <- test %>%
mutate(Log_residual_sugar = log(residual_sugar), Log_chlorides = log(chlorides),
Log_free_sulfur_dioxide = log(free_sulfur_dioxide), Log_total_sulfur_dioxide = log(total_sulfur_dioxide),
Log_sulphates = log(sulphates))
Modificamos nuestro dataset de test para que las variables transformadas a logaritmos sustituyan a las mismas pero aún sin transformar. Tendremos de ese modo un dataset con 12 variables también, pero 5 de ellas transformadas a logaritmos.
test <- test %>%
dplyr::select(-residual_sugar, -chlorides, -free_sulfur_dioxide,
-total_sulfur_dioxide, -sulphates)
head(test)
## # A tibble: 6 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol quality pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 0.998 9.4 5 3.51
## 2 7.3 0.65 0 0.995 10 7 3.39
## 3 8.9 0.22 0.48 0.997 9.4 6 3.39
## 4 7.6 0.41 0.24 0.996 9.5 5 3.28
## 5 7.1 0.71 0 0.997 9.4 5 3.47
## 6 5.7 1.13 0.09 0.994 9.8 4 3.5
## # … with 5 more variables: Log_residual_sugar <dbl>, Log_chlorides <dbl>,
## # Log_free_sulfur_dioxide <dbl>, Log_total_sulfur_dioxide <dbl>,
## # Log_sulphates <dbl>
Una vez analizado en profundidad nuestro conjunto de datos y habiendo entendido y tranformado nuetras variables, trataremos de ajustar un modelo de regresión lineal múltiple que trate de predicir la calidad del vino tinto de la variedad portuguesa de “Vinho Verde”.
Ajustamos un modelo de regresión lineal mútiple con el que vamos a predecir el valor de la variable quality a partir de las siguientes variables independientes(cogemos todas las variables menos “Log_residual_sugar” que no presenta ninguna correlación con la variable “quality”) seleccionadas en base a los análisis y estudios de correlación vistos con anterioridad.
modelo = lm(quality ~ alcohol + fixed_acidity + volatile_acidity +
citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
Log_free_sulfur_dioxide + density + pH + Log_sulphates, data = train)
summary(modelo)
##
## Call:
## lm(formula = quality ~ alcohol + fixed_acidity + volatile_acidity +
## citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
## Log_free_sulfur_dioxide + density + pH + Log_sulphates, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.27885 -0.34712 -0.05254 0.43254 1.89518
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.94814 18.67185 -0.533 0.594274
## alcohol 0.30374 0.02413 12.587 < 2e-16 ***
## fixed_acidity 0.01981 0.02396 0.827 0.408630
## volatile_acidity -1.04091 0.12864 -8.092 1.37e-15 ***
## citric_acid -0.31703 0.15340 -2.067 0.038963 *
## Log_chlorides -0.22535 0.06205 -3.632 0.000293 ***
## Log_total_sulfur_dioxide -0.15705 0.04408 -3.563 0.000380 ***
## Log_free_sulfur_dioxide 0.12800 0.04305 2.973 0.003001 **
## density 14.65001 19.00095 0.771 0.440842
## pH -0.50069 0.18785 -2.665 0.007788 **
## Log_sulphates 0.85751 0.09545 8.984 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6241 on 1268 degrees of freedom
## Multiple R-squared: 0.3862, Adjusted R-squared: 0.3813
## F-statistic: 79.77 on 10 and 1268 DF, p-value: < 2.2e-16
Para la selección de variables se utiliza el método de la selección automática por pasos.
empty.model <- lm(quality ~ 1, data = train)
horizonte <- formula(quality ~ alcohol + fixed_acidity + volatile_acidity +
citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
Log_free_sulfur_dioxide + density + pH + Log_sulphates)
# metodo de selección por pasos e indica las variables que
# son significativas
seleccion = stepAIC(empty.model, direction = c("both"), trace = FALSE,
scope = horizonte)
seleccion$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## quality ~ 1
##
## Final Model:
## quality ~ alcohol + volatile_acidity + Log_sulphates + Log_chlorides +
## pH + Log_total_sulfur_dioxide + Log_free_sulfur_dioxide
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 1278 804.4848 -590.9851
## 2 + alcohol 1 192.838647 1277 611.6461 -939.4926
## 3 + volatile_acidity 1 69.859784 1276 541.7863 -1092.6125
## 4 + Log_sulphates 1 29.717001 1275 512.0693 -1162.7631
## 5 + Log_chlorides 1 4.327006 1274 507.7423 -1171.6166
## 6 + pH 1 5.098410 1273 502.6439 -1182.5244
## 7 + Log_total_sulfur_dioxide 1 2.458061 1272 500.1858 -1186.7944
## 8 + Log_free_sulfur_dioxide 1 3.993776 1271 496.1921 -1195.0476
Vemos la información del modelo elegido como “mejor”
summary(seleccion)
##
## Call:
## lm(formula = quality ~ alcohol + volatile_acidity + Log_sulphates +
## Log_chlorides + pH + Log_total_sulfur_dioxide + Log_free_sulfur_dioxide,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.24223 -0.35766 -0.05925 0.43097 1.88984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.97311 0.44122 11.271 < 2e-16 ***
## alcohol 0.28289 0.01841 15.366 < 2e-16 ***
## volatile_acidity -0.90819 0.10940 -8.301 2.60e-16 ***
## Log_sulphates 0.86434 0.09387 9.208 < 2e-16 ***
## Log_chlorides -0.23817 0.05970 -3.989 7.00e-05 ***
## pH -0.52666 0.13211 -3.987 7.08e-05 ***
## Log_total_sulfur_dioxide -0.17181 0.04226 -4.065 5.10e-05 ***
## Log_free_sulfur_dioxide 0.13511 0.04224 3.198 0.00142 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6248 on 1271 degrees of freedom
## Multiple R-squared: 0.3832, Adjusted R-squared: 0.3798
## F-statistic: 112.8 on 7 and 1271 DF, p-value: < 2.2e-16
Nos quedamos con el modelo seleccionado como el mejor para la regresión según el método utilizado anteriormente.
mejor_modelo = lm(quality ~ alcohol + volatile_acidity + Log_sulphates +
Log_chlorides + pH + Log_total_sulfur_dioxide + citric_acid +
fixed_acidity, data = train)
summary(mejor_modelo)
##
## Call:
## lm(formula = quality ~ alcohol + volatile_acidity + Log_sulphates +
## Log_chlorides + pH + Log_total_sulfur_dioxide + citric_acid +
## fixed_acidity, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.31169 -0.35282 -0.05415 0.42911 1.86472
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.27164 0.63721 6.704 3.05e-11 ***
## alcohol 0.30000 0.01884 15.921 < 2e-16 ***
## volatile_acidity -1.07446 0.12691 -8.466 < 2e-16 ***
## Log_sulphates 0.88622 0.09458 9.370 < 2e-16 ***
## Log_chlorides -0.21728 0.06080 -3.574 0.000365 ***
## pH -0.40708 0.16560 -2.458 0.014097 *
## Log_total_sulfur_dioxide -0.05076 0.02651 -1.915 0.055768 .
## citric_acid -0.38575 0.15186 -2.540 0.011197 *
## fixed_acidity 0.03469 0.01610 2.155 0.031357 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6258 on 1270 degrees of freedom
## Multiple R-squared: 0.3818, Adjusted R-squared: 0.3779
## F-statistic: 98.03 on 8 and 1270 DF, p-value: < 2.2e-16
Determinamos los intervalos de confianza para las observaciones de nuestros datos.
intervalos = predict(mejor_modelo, interval = "confidence", level = 0.95)
head(intervalos)
## fit lwr upr
## 1 5.545952 5.455099 5.636806
## 2 5.095579 5.013600 5.177558
## 3 4.981789 4.813456 5.150122
## 4 5.381935 5.312809 5.451061
## 5 5.445889 5.371787 5.519992
## 6 4.747729 4.589377 4.906082
La tabla anova nos muestra la significación de la regresión
anova = aov(mejor_modelo)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## alcohol 1 192.8 192.84 492.415 < 2e-16 ***
## volatile_acidity 1 69.9 69.86 178.387 < 2e-16 ***
## Log_sulphates 1 29.7 29.72 75.883 < 2e-16 ***
## Log_chlorides 1 4.3 4.33 11.049 0.000913 ***
## pH 1 5.1 5.10 13.019 0.000320 ***
## Log_total_sulfur_dioxide 1 2.5 2.46 6.277 0.012358 *
## citric_acid 1 1.0 1.01 2.584 0.108223
## fixed_acidity 1 1.8 1.82 4.644 0.031357 *
## Residuals 1270 497.4 0.39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vif(mejor_modelo)
## alcohol volatile_acidity Log_sulphates
## 1.326586 1.667617 1.236976
## Log_chlorides pH Log_total_sulfur_dioxide
## 1.301107 1.887347 1.125057
## citric_acid fixed_acidity
## 2.892183 2.660426
mean(vif(mejor_modelo))
## [1] 1.762162
Generalmente, un VIF por encima de 4 o una tolerancia por debajo de 0,25 indica que podría existir multicolinealidad (fuerte correlación entre variables explicativas del modelo) y se requiere más investigación. Cuando el VIF es superior a 10 o la tolerancia es inferior a 0,1, existe una multicolinealidad significativa que debe corregirse. En este caso no se observa multicolinealidad.
mean(mejor_modelo$residuals)
## [1] -1.419695e-17
# forma grafico 1
plot(mejor_modelo, 1)
# forma grafico 2 que te muestra lo mismo
autoplot(mejor_modelo, 1)
En el gráfico de Residuos vs. Ajustes se observa que la media de los residuos es cercana a cero (aunque no de forma constante), luego la linealidad del modelo no se viola en principio. Pero, al tener una variable dependiente como “quality” que es discreta, un modelo de regresión linela normal no se ajusta a nuestros datos.
Primero se comprueba la normalidad de los residuos, pero al usar Shapiro test solo permite usar las 5000 primeras muestras de los residuos, así que también usamos Anderson-Darling para comparar resultados
Shapiro-Wilk:
# muestras_residuos=resid(mejor_modelo) obtengo la
# ditribucion de los residuos estandarizados
muestras_residuos1 = studres(mejor_modelo)
residual_norm = shapiro.test(muestras_residuos1[0:5000])
residual_norm
##
## Shapiro-Wilk normality test
##
## data: muestras_residuos1[0:5000]
## W = 0.98969, p-value = 7.913e-08
Anderson-Darling:
# install.packages('nortest')
residual_anderson = ad.test(muestras_residuos1)
residual_anderson
##
## Anderson-Darling normality test
##
## data: muestras_residuos1
## A = 4.0651, p-value = 4.078e-10
Este supuesto de normalidad de los residuos también se puede comprobar graficamente y como se ve en la gráfica nuestros datos se separan en las colas de la línea principal y eso nos puede indicar que los residuos no siguen una distribución normal.
# Estas tres graficas te muestran lo mismo
plot(mejor_modelo, 2)
autoplot(mejor_modelo, 2)
hist(muestras_residuos1, freq = FALSE, main = "Distribución de los residuos estandarizados")
xfit <- seq(min(muestras_residuos1), max(muestras_residuos1),
length = 40)
yfit <- dnorm(xfit)
lines(xfit, yfit)
Con el Q-Q plot vemos que los residuos siguen una distribución normal o al menos se aproximan. Por tanto, se puede asumir que los estimadores de los coeficientes tengan una distribución normal.
Vamos a comprobar la homocedasticidad (que los residuos tengan una varianza constante)
Como podemos ver en los resultados p_value < 0.05 por tanto se rechaza la hipotesis nula y esto indica que la varianza no es constante para este modelo de regresion lineal(hay heterocedasticidad, y esto es un problema). Podemos concluir que este modelo matemático no es adecuado.
# https://fhernanb.github.io/libro_regresion/homo.html otra
# prueba para comprobar homocedasticidad
ncvTest(mejor_modelo)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 17.5856, Df = 1, p = 2.7466e-05
También podemos comprobar gráficamente la hocedasticidad, sería bueno que la línea roja sea lo más recta/horizontal posible.
plot(mejor_modelo, 3)
autoplot(mejor_modelo, 3)
Como se puede ver en los resultados el p_value > 0.05 por lo que aceptamos la Ho de que hay independencia.
dwtest(mejor_modelo)
##
## Durbin-Watson test
##
## data: mejor_modelo
## DW = 2.0088, p-value = 0.563
## alternative hypothesis: true autocorrelation is greater than 0
Se puede comprobar la independencia de los residuos gráficamente y como se observa no se ven patrónes extraños y esto nos puede indicar que hay independencia en los residuos y que estos no presentan autocorrelación.
plot(mejor_modelo$resid)
acf(mejor_modelo$residuals)
Para el análisis de componentes principales cogemos todas las variables de nuestro dataset, menos “quality” que es la que queremos tratar de predecir.
prcomp_train <- prcomp(train[, -6])
prcomp_train
## Standard deviations (1, .., p=11):
## [1] 1.7932919537 1.1139049116 0.8715249896 0.3582166080 0.3099346034
## [6] 0.3016065686 0.2104320866 0.1621592924 0.1042111183 0.0987024931
## [11] 0.0007134006
##
## Rotation (n x k) = (11 x 11):
## PC1 PC2 PC3 PC4
## fixed_acidity 0.987707565 -0.0103012861 -1.022231e-01 0.062459844
## volatile_acidity -0.025573618 -0.0335218329 2.264482e-02 -0.079018774
## citric_acid 0.074203260 0.0250265448 -3.742426e-02 -0.046522512
## density 0.000724006 -0.0007910339 8.763104e-05 -0.001630274
## alcohol -0.033238009 0.9069893906 -3.959749e-01 -0.033337688
## pH -0.052371326 0.0213916196 4.913437e-03 0.011180103
## Log_residual_sugar 0.039855973 -0.0021044752 -1.037845e-01 -0.826895669
## Log_chlorides 0.031702071 -0.0848605722 3.174456e-02 -0.391082136
## Log_free_sulfur_dioxide -0.086908780 -0.2361937132 -6.575105e-01 0.323215453
## Log_total_sulfur_dioxide -0.061636505 -0.3343377688 -6.206551e-01 -0.211995173
## Log_sulphates 0.021906011 0.0167601084 -3.879522e-02 -0.018426525
## PC5 PC6 PC7 PC8
## fixed_acidity -0.0230198704 -0.0059651163 -0.064939449 -0.039228108
## volatile_acidity -0.0390618557 -0.0303160677 -0.597211835 -0.541189128
## citric_acid 0.0729849365 0.0909937418 0.381491021 0.409952831
## density -0.0002677616 -0.0007711058 0.000185434 -0.001576883
## alcohol 0.0565568851 0.0881221105 -0.082562820 -0.001514142
## pH -0.0569108452 -0.0518242796 -0.038186548 -0.140386718
## Log_residual_sugar -0.2808810688 -0.4555153363 0.125710162 -0.007655261
## Log_chlorides 0.8611890499 0.1118830637 -0.226520421 0.140520863
## Log_free_sulfur_dioxide 0.2215346137 -0.5890496504 -0.021993993 0.034613521
## Log_total_sulfur_dioxide -0.1995578676 0.6396298085 -0.031382301 -0.022315853
## Log_sulphates 0.2770165639 0.0647750830 0.645544341 -0.704485728
## PC9 PC10 PC11
## fixed_acidity 0.041080555 -0.0443511620 0.0008574666
## volatile_acidity -0.081317870 0.5769403933 0.0004106125
## citric_acid 0.118259173 0.8056760674 -0.0002945593
## density 0.004180642 -0.0002855141 -0.9999877190
## alcohol -0.015362480 -0.0183878205 -0.0008768774
## pH 0.983885416 -0.0388597280 0.0043212789
## Log_residual_sugar -0.025609909 -0.0311525353 0.0017331842
## Log_chlorides 0.071354850 -0.0867580300 0.0004730618
## Log_free_sulfur_dioxide -0.012248313 0.0445217874 -0.0001883117
## Log_total_sulfur_dioxide 0.025005306 -0.0566273254 0.0002213595
## Log_sulphates -0.053924917 0.0228971577 0.0009037505
Las desviaciones típicas son los autovalores de la matriz de correlaciones, y representan la variabilidad en cada componente. A mayor valor, más relevante es la variable correspondiente a efectos de visualización. Si queremos visualizar la importancia relativa de cada componente, haremos lo siguiente:
plot(prcomp_train)
De modo númérico:
summary(prcomp_train)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7933 1.1139 0.8715 0.35822 0.30993 0.30161 0.21043
## Proportion of Variance 0.5719 0.2207 0.1351 0.02282 0.01708 0.01618 0.00788
## Cumulative Proportion 0.5719 0.7926 0.9277 0.95052 0.96761 0.98378 0.99166
## PC8 PC9 PC10 PC11
## Standard deviation 0.16216 0.10421 0.09870 0.0007134
## Proportion of Variance 0.00468 0.00193 0.00173 0.0000000
## Cumulative Proportion 0.99634 0.99827 1.00000 1.0000000
Para solucionar el problema de que una variable tenga más relevancia y sea más influyente por el hecho de tener más magnitud, se debe realizar una estandarización:
prcomp_train <- prcomp(train[, -6], centre = TRUE, scale = TRUE)
prcomp_train
## Standard deviations (1, .., p=11):
## [1] 1.7629487 1.4397808 1.2644531 1.0430903 0.9814598 0.8259786 0.7584904
## [8] 0.6527267 0.5072574 0.4102052 0.2443802
##
## Rotation (n x k) = (11 x 11):
## PC1 PC2 PC3 PC4
## fixed_acidity 0.497949147 -0.0774604042 0.07773493 -0.12761147
## volatile_acidity -0.228613756 0.2973557116 0.42725833 -0.12854199
## citric_acid 0.453908847 -0.1799547686 -0.23564528 -0.04506366
## density 0.414489403 0.2687629961 0.25839356 -0.17865109
## alcohol -0.094708706 -0.4167029020 -0.37845423 -0.33087315
## pH -0.409475084 -0.0009806387 -0.03806645 -0.15945585
## Log_residual_sugar 0.198616078 0.2082892990 -0.02938800 -0.70534037
## Log_chlorides 0.227309032 0.2157308428 0.20635906 0.40324174
## Log_free_sulfur_dioxide -0.074197539 0.4790706213 -0.48908782 0.01812011
## Log_total_sulfur_dioxide -0.004662316 0.5505085334 -0.39476662 0.01314934
## Log_sulphates 0.220597718 -0.0694794605 -0.32548381 0.37112255
## PC5 PC6 PC7 PC8
## fixed_acidity 0.20602872 -0.01298276 0.32851510 -0.2897369
## volatile_acidity -0.17298151 -0.29204526 0.60308842 -0.1819769
## citric_acid 0.08260043 -0.05040931 -0.15768865 -0.3605598
## density -0.04472090 0.41364082 0.07930408 -0.1990981
## alcohol -0.26306784 -0.40477572 0.21105461 -0.2724748
## pH -0.33063659 0.52406092 -0.18375767 -0.5565766
## Log_residual_sugar -0.37735390 -0.05607819 -0.22622011 0.4088799
## Log_chlorides -0.50419008 -0.43132228 -0.39929144 -0.2507839
## Log_free_sulfur_dioxide 0.11152021 -0.07612622 0.07572221 -0.1353631
## Log_total_sulfur_dioxide 0.14453935 -0.07108447 0.02253235 -0.0743601
## Log_sulphates -0.55709025 0.31997827 0.44949490 0.2744915
## PC9 PC10 PC11
## fixed_acidity 0.31604555 0.126404238 -0.611040342
## volatile_acidity -0.32007451 -0.213814616 0.005828197
## citric_acid -0.61379756 -0.394018864 0.088273875
## density 0.18509322 0.158370103 0.615559165
## alcohol 0.19703270 0.271013852 0.317142337
## pH -0.02868781 -0.001296862 -0.277713094
## Log_residual_sugar -0.04278776 -0.103130711 -0.205992070
## Log_chlorides 0.15996348 0.070066978 -0.059433915
## Log_free_sulfur_dioxide 0.43001182 -0.541972628 0.067211285
## Log_total_sulfur_dioxide -0.35634449 0.611908258 -0.086661149
## Log_sulphates -0.08777829 -0.028491471 -0.064760413
De modo numérico también:
summary(prcomp_train)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7629 1.4398 1.2645 1.04309 0.98146 0.82598 0.7585
## Proportion of Variance 0.2825 0.1885 0.1454 0.09891 0.08757 0.06202 0.0523
## Cumulative Proportion 0.2825 0.4710 0.6163 0.71526 0.80283 0.86485 0.9172
## PC8 PC9 PC10 PC11
## Standard deviation 0.65273 0.50726 0.4102 0.24438
## Proportion of Variance 0.03873 0.02339 0.0153 0.00543
## Cumulative Proportion 0.95588 0.97927 0.9946 1.00000
Analizamos la varianzas y las componentes de un modo más gráfico:
prcomp_train.var <- prcomp_train$sdev^2
prcomp_train.pvar <- prcomp_train.var/sum(prcomp_train.var)
plot(cumsum(prcomp_train.pvar), xlab = "components", ylab = "cumulative variance",
ylim = c(0, 1), type = "b")
grid()
abline(h = 0.95, col = "blue")
plot(prcomp_train, type = "l", main = "Variance explained by PCA")
fviz_screeplot(prcomp_train, addlabels = TRUE)
Como vemos, con las dos primeras componentes (PC1 y PC2) recogemos solo el 47.10% de la variabilidad. Con las tres primeras (PC1, PC2 y PC3) incrementamos la cifra hasta el 61.63%. Esto quiere decir que un gráfico de los datos del vino representados por las dos o tres primeras componentes principales no será suficientemente representativo. Vemos además en el gráfico de componentes y varianza acumulada, como son necesarias las 8 primeras PC para cubrir el 95% de la varianza del dataset. Es dificil encontrale sentido a reducir tan solo la dimensión de 11 variables a 8 PC, con la perdida de explicabilidad que eso implica sobre las variables originales.
Dibujamos los datos proyectados sobre las dos primeras componentes:
ggplot(as.data.frame(prcomp(train[, -6], scale = T)$x[, 1:2]),
aes(x = PC1, y = PC2, label = rownames(train))) + geom_point() +
geom_text(hjust = 0, vjust = 0)
Tratamos de incorporar la información de las variables utilizando la técnica del “biplot”:
ggbiplot(prcomp(train[, -6], labels = rownames(train), scale = T))
ggbiplot(prcomp(train[, -6], scale = T), ellipse = TRUE, labels = rownames(train),
groups = train$quality)
train_fquality <- train %>%
mutate(quality = as.factor(quality))
ggbiplot(prcomp_train, obs.scale = 1, var.scale = 1, groups = train_fquality$quality,
ellipse = TRUE, circle = TRUE) + scale_color_discrete(name = "") +
theme(legend.direction = "horizontal", legend.position = "top")
Vemos que el análisis con solo 2 componentes no es óptimo ya que por ellas mismas no explican un alto porcentaje de la varianza. Aún así, a nivel de análisis explicativo de los datos y de los posibles diferentes grupos, se intuye algún patrón ya que en principio cuanto más abajo del gráfico, mejor calificación tienen los vinos en general (puntos de colores azul y rosa son notas más cercanas a 7 y 8) y más arriba, peor calificación (puntos de colores verde, amarillo y rojo son notas de 5 para abajo).
Realizamos una ampliación del análisis realizado utilizando las 4 primeras componentes principales para tratar de identificar posible agrupaciones más claras de los datos.
colores <- function(vec) {
# la función rainbow() devuelve un vector que contiene
# el número de colores distintos
col <- rainbow(length(unique(vec)))
return(col[as.numeric(as.factor(vec))])
}
par(mfrow = c(1, 2))
# Observaciones sobre PC1 y PC2
plot(prcomp_train$x[, 1:2], col = colores(train_fquality$quality),
pch = 19, xlab = "PC1", ylab = "PC2")
# Observaciones sobre PC1 y PC3
plot(prcomp_train$x[, c(1, 3)], col = colores(train_fquality$quality),
pch = 19, xlab = "PC1", ylab = "PC3")
# Observaciones sobre PC1 y PC4
plot(prcomp_train$x[, c(1, 4)], col = colores(train_fquality$quality),
pch = 19, xlab = "PC1", ylab = "PC4")
La utilización de más componentes (ampliando el análisis hasta la tercera y la cuarta PC) vemos que aporta muy poco y no vemos agrupaciones claras o destacables entre los diferentes grupos. Esto se debe que incluso utilizando las 4 dimensiones de las 4 primeras PC, apenas lograriamos explicar un 71.52% de la varianza de los datos.
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”).
train_pca <- train[, colnames(train) != "quality"]
train_pca$nota_vino <- factor(train$quality < 6, labels = c("aprobado",
"suspenso")) # levels = c('FALSE', 'TRUE')
str(train_pca)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : Factor w/ 2 levels "aprobado","suspenso": 2 2 2 2 2 2 2 2 1 1 ...
table(train_pca$nota_vino)
##
## aprobado suspenso
## 682 597
train_pca
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <fct>
Pasamos a realizar el análisis de las Componentes Principales tal y como se ha hecho con anterioridad:
prcomp_train_2 <- prcomp(train_pca[, -12], centre = TRUE, scale = TRUE)
prcomp_train_2
## Standard deviations (1, .., p=11):
## [1] 1.7629487 1.4397808 1.2644531 1.0430903 0.9814598 0.8259786 0.7584904
## [8] 0.6527267 0.5072574 0.4102052 0.2443802
##
## Rotation (n x k) = (11 x 11):
## PC1 PC2 PC3 PC4
## fixed_acidity 0.497949147 -0.0774604042 0.07773493 -0.12761147
## volatile_acidity -0.228613756 0.2973557116 0.42725833 -0.12854199
## citric_acid 0.453908847 -0.1799547686 -0.23564528 -0.04506366
## density 0.414489403 0.2687629961 0.25839356 -0.17865109
## alcohol -0.094708706 -0.4167029020 -0.37845423 -0.33087315
## pH -0.409475084 -0.0009806387 -0.03806645 -0.15945585
## Log_residual_sugar 0.198616078 0.2082892990 -0.02938800 -0.70534037
## Log_chlorides 0.227309032 0.2157308428 0.20635906 0.40324174
## Log_free_sulfur_dioxide -0.074197539 0.4790706213 -0.48908782 0.01812011
## Log_total_sulfur_dioxide -0.004662316 0.5505085334 -0.39476662 0.01314934
## Log_sulphates 0.220597718 -0.0694794605 -0.32548381 0.37112255
## PC5 PC6 PC7 PC8
## fixed_acidity 0.20602872 -0.01298276 0.32851510 -0.2897369
## volatile_acidity -0.17298151 -0.29204526 0.60308842 -0.1819769
## citric_acid 0.08260043 -0.05040931 -0.15768865 -0.3605598
## density -0.04472090 0.41364082 0.07930408 -0.1990981
## alcohol -0.26306784 -0.40477572 0.21105461 -0.2724748
## pH -0.33063659 0.52406092 -0.18375767 -0.5565766
## Log_residual_sugar -0.37735390 -0.05607819 -0.22622011 0.4088799
## Log_chlorides -0.50419008 -0.43132228 -0.39929144 -0.2507839
## Log_free_sulfur_dioxide 0.11152021 -0.07612622 0.07572221 -0.1353631
## Log_total_sulfur_dioxide 0.14453935 -0.07108447 0.02253235 -0.0743601
## Log_sulphates -0.55709025 0.31997827 0.44949490 0.2744915
## PC9 PC10 PC11
## fixed_acidity 0.31604555 0.126404238 -0.611040342
## volatile_acidity -0.32007451 -0.213814616 0.005828197
## citric_acid -0.61379756 -0.394018864 0.088273875
## density 0.18509322 0.158370103 0.615559165
## alcohol 0.19703270 0.271013852 0.317142337
## pH -0.02868781 -0.001296862 -0.277713094
## Log_residual_sugar -0.04278776 -0.103130711 -0.205992070
## Log_chlorides 0.15996348 0.070066978 -0.059433915
## Log_free_sulfur_dioxide 0.43001182 -0.541972628 0.067211285
## Log_total_sulfur_dioxide -0.35634449 0.611908258 -0.086661149
## Log_sulphates -0.08777829 -0.028491471 -0.064760413
summary(prcomp_train_2)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7629 1.4398 1.2645 1.04309 0.98146 0.82598 0.7585
## Proportion of Variance 0.2825 0.1885 0.1454 0.09891 0.08757 0.06202 0.0523
## Cumulative Proportion 0.2825 0.4710 0.6163 0.71526 0.80283 0.86485 0.9172
## PC8 PC9 PC10 PC11
## Standard deviation 0.65273 0.50726 0.4102 0.24438
## Proportion of Variance 0.03873 0.02339 0.0153 0.00543
## Cumulative Proportion 0.95588 0.97927 0.9946 1.00000
ggbiplot(prcomp_train_2, obs.scale = 1, var.scale = 1, groups = train_pca$nota_vino,
ellipse = TRUE, circle = TRUE) + scale_color_discrete(name = "") +
theme(legend.direction = "horizontal", legend.position = "top")
Vemos que los resultados obtenidos son los mismos, no obteniendo ninguna mejora. Con esta forma de mostrar los datos realizamos una visualización más clara de lo que nos referiamos.Los puntos más abajo del gráfico se corresponden en general a vinos “aprobados” (puntos de color rosado - vinos con nota igual o superior a 6) y los de más arriba se referencian en general a vinos “suspensos” (puntos de color azulado - vino con notas inferiores a 6). Fuera de eso, y con tan solo un 47.10% de la varianza explicada por las 2 primeras PC, no se aprecian más patrones o conclusiones en los datos.
Intentamos realizar una reducción de la dimensión pero esta vez con métodos, al contrario de PCA, que no sean lineales. Con el algoritmo de t-SNE podemos separar datos que no sean separables de una forma lineal con exclusivamente una línea recta, es decir, nos puede llegar a permitir trabajar con datos lineales no separables. Nos puede servir para llegar a entender datos que tienen una alta dimensión projectándolo a una dimensión menor de solo 2 o 3 espacios o dimensiones.
tsne_train <- (train[, -6])
tsne_train
## # A tibble: 1,279 × 11
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 5 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>
El algoritmo crea una probabilidad de distribución que representa las similaridades entre los vecinos para así tratar de agruparlos, reduciendo la dimensión.
set.seed(3)
tsne_data <- tsne_train[, 1:11]
tsne <- Rtsne(tsne_data, check_duplicates = FALSE, perplexity = 30,
pca = FALSE, theta = 0.5, dims = 2, max_iter = 500, eta = 200,
epoch = 1000)
par(mfrow = c(1, 2))
plot(tsne$Y, col = "black", bg = train_fquality$quality, pch = 21,
cex = 1.5, main = "tSNE", xlab = "tSNE dimension 1", ylab = "tSNE dimension 2")
Como vemos los resultados, como en PCA, no son satisfactorios, siendo no deseable la apliclación de estas técnicas en nuestro dataset.
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”).
train_tsne <- train[, colnames(train) != "quality"]
train_tsne$nota_vino <- factor(train$quality < 6, labels = c("aprobado",
"suspenso")) # levels = c('FALSE', 'TRUE')
str(train_tsne)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : Factor w/ 2 levels "aprobado","suspenso": 2 2 2 2 2 2 2 2 1 1 ...
table(train_tsne$nota_vino)
##
## aprobado suspenso
## 682 597
train_tsne
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <fct>
set.seed(3)
tsne_data_2 <- train_tsne[, 1:11]
tsne_2 <- Rtsne(tsne_data_2, check_duplicates = FALSE, perplexity = 30,
pca = FALSE, theta = 0.5, dims = 2, max_iter = 500, eta = 200,
epoch = 1000)
par(mfrow = c(1, 2))
plot(tsne$Y, col = "black", bg = train_tsne$nota_vino, pch = 21,
cex = 1.5, main = "tSNE", xlab = "tSNE dimension 1", ylab = "tSNE dimension 2")
Binarizando la variable respuesta tampoco sacamos demasiado en claro, no siendo posible aplicar una reducción de la dimensión sobre nuestros datos.
Lo primero de todo, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”)
train_glm <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_glm
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_glm$nota_vino)
##
## 0 1
## 597 682
str(train_glm)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
Realizando esta distinción entre vinos “Aprobados” y “Suspensos”, vemos que la distibución entre ambos grupos está bastante balanceada, con 597 suspensos y 682 aprobados en los datos de train.
Tras ello, pasamos a ver las correlaciones y el comportamiento de las variables con esta nueva variable categórica creada:
c <- cor(train_glm)
corrplot(c)
Mostramos las correlaciones de forma numérica:
round(c, 2)
## fixed_acidity volatile_acidity citric_acid density
## fixed_acidity 1.00 -0.26 0.68 0.68
## volatile_acidity -0.26 1.00 -0.55 0.02
## citric_acid 0.68 -0.55 1.00 0.37
## density 0.68 0.02 0.37 1.00
## alcohol -0.05 -0.21 0.15 -0.49
## pH -0.64 0.22 -0.49 -0.32
## Log_residual_sugar 0.20 0.04 0.19 0.44
## Log_chlorides 0.16 0.09 0.16 0.33
## Log_free_sulfur_dioxide -0.18 0.03 -0.11 -0.04
## Log_total_sulfur_dioxide -0.12 0.08 -0.03 0.11
## Log_sulphates 0.19 -0.30 0.32 0.14
## nota_vino 0.11 -0.32 0.18 -0.15
## alcohol pH Log_residual_sugar Log_chlorides
## fixed_acidity -0.05 -0.64 0.20 0.16
## volatile_acidity -0.21 0.22 0.04 0.09
## citric_acid 0.15 -0.49 0.19 0.16
## density -0.49 -0.32 0.44 0.33
## alcohol 1.00 0.18 0.06 -0.29
## pH 0.18 1.00 -0.10 -0.26
## Log_residual_sugar 0.06 -0.10 1.00 0.12
## Log_chlorides -0.29 -0.26 0.12 1.00
## Log_free_sulfur_dioxide -0.09 0.08 0.10 -0.02
## Log_total_sulfur_dioxide -0.24 -0.03 0.17 0.06
## Log_sulphates 0.13 -0.14 0.02 0.22
## nota_vino 0.45 -0.01 0.00 -0.14
## Log_free_sulfur_dioxide Log_total_sulfur_dioxide
## fixed_acidity -0.18 -0.12
## volatile_acidity 0.03 0.08
## citric_acid -0.11 -0.03
## density -0.04 0.11
## alcohol -0.09 -0.24
## pH 0.08 -0.03
## Log_residual_sugar 0.10 0.17
## Log_chlorides -0.02 0.06
## Log_free_sulfur_dioxide 1.00 0.79
## Log_total_sulfur_dioxide 0.79 1.00
## Log_sulphates 0.06 0.04
## nota_vino -0.06 -0.20
## Log_sulphates nota_vino
## fixed_acidity 0.19 0.11
## volatile_acidity -0.30 -0.32
## citric_acid 0.32 0.18
## density 0.14 -0.15
## alcohol 0.13 0.45
## pH -0.14 -0.01
## Log_residual_sugar 0.02 0.00
## Log_chlorides 0.22 -0.14
## Log_free_sulfur_dioxide 0.06 -0.06
## Log_total_sulfur_dioxide 0.04 -0.20
## Log_sulphates 1.00 0.28
## nota_vino 0.28 1.00
Analizamos de forma bivariante las variables:
# nota_vino vs alcohol
train_glm %>%
ggplot(aes(x = alcohol, fill = factor(nota_vino))) + geom_density(alpha = 0.5)
# nota_vino vs Log_sulphates
train_glm %>%
ggplot(aes(x = Log_sulphates, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs volatile_acidity
train_glm %>%
ggplot(aes(x = volatile_acidity, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs density
train_glm %>%
ggplot(aes(x = density, fill = factor(nota_vino))) + geom_density(alpha = 0.5)
# nota_vino vs citric_acid
train_glm %>%
ggplot(aes(x = citric_acid, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs Log_total_sulfur_dioxide
train_glm %>%
ggplot(aes(x = Log_total_sulfur_dioxide, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
En términos generales vemos como los vinos analizados que estan en la categoria de aprobados, tienen un mayor valor de “alcohol”, levenmente mayor valor de “Log_sulphates”, menor valor de “volatile_acidity”, levemente menor “density”, mayor “citric_acid” y menor valor de “Log_total_sulfur_dioxide”.
# nota_vino vs fixed_acidity
train_glm %>%
ggplot(aes(x = fixed_acidity, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs Log_free_sulfur_dioxide
train_glm %>%
ggplot(aes(x = Log_free_sulfur_dioxide, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs Log_residual_sugar
train_glm %>%
ggplot(aes(x = Log_residual_sugar, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
# nota_vino vs pH
train_glm %>%
ggplot(aes(x = pH, fill = factor(nota_vino))) + geom_density(alpha = 0.5)
# nota_vino vs Log_chlorides
train_glm %>%
ggplot(aes(x = Log_chlorides, fill = factor(nota_vino))) +
geom_density(alpha = 0.5)
En los casos de las variables “Log_chlorides”, “pH”, “Log_residual_sugar”, “Log_free_sulfur_dioxide” y “fixed_acidity”, cuesta más distinguir en el gráfico de densidad entre vinos aprobados o suspensos, ya que no son características definitivas de un grupo u otro.
Generamos un modelo de regresión logística en base a las variables de nuestro dataset que sirva como predictor de la variable binaria creada.
modelo_glm = glm(nota_vino ~ alcohol + fixed_acidity + volatile_acidity +
citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
Log_free_sulfur_dioxide + density + pH + Log_sulphates, data = train_glm,
family = binomial)
modelo_glm
##
## Call: glm(formula = nota_vino ~ alcohol + fixed_acidity + volatile_acidity +
## citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
## Log_free_sulfur_dioxide + density + pH + Log_sulphates, family = binomial,
## data = train_glm)
##
## Coefficients:
## (Intercept) alcohol fixed_acidity
## -24.7457 0.9700 0.1753
## volatile_acidity citric_acid Log_chlorides
## -3.2462 -1.7627 -0.4797
## Log_total_sulfur_dioxide Log_free_sulfur_dioxide density
## -0.6740 0.4413 17.5474
## pH Log_sulphates
## -0.1818 2.6310
##
## Degrees of Freedom: 1278 Total (i.e. Null); 1268 Residual
## Null Deviance: 1767
## Residual Deviance: 1300 AIC: 1322
summary(modelo_glm)
##
## Call:
## glm(formula = nota_vino ~ alcohol + fixed_acidity + volatile_acidity +
## citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
## Log_free_sulfur_dioxide + density + pH + Log_sulphates, family = binomial,
## data = train_glm)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3178 -0.8232 0.3053 0.7861 2.4429
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.74575 73.63188 -0.336 0.73682
## alcohol 0.96996 0.10386 9.340 < 2e-16 ***
## fixed_acidity 0.17526 0.09475 1.850 0.06436 .
## volatile_acidity -3.24621 0.54263 -5.982 2.20e-09 ***
## citric_acid -1.76266 0.60585 -2.909 0.00362 **
## Log_chlorides -0.47972 0.24746 -1.939 0.05256 .
## Log_total_sulfur_dioxide -0.67404 0.17254 -3.907 9.36e-05 ***
## Log_free_sulfur_dioxide 0.44134 0.16690 2.644 0.00818 **
## density 17.54736 74.95795 0.234 0.81491
## pH -0.18179 0.73995 -0.246 0.80593
## Log_sulphates 2.63103 0.39252 6.703 2.04e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1767.4 on 1278 degrees of freedom
## Residual deviance: 1300.2 on 1268 degrees of freedom
## AIC: 1322.2
##
## Number of Fisher Scoring iterations: 4
Como observamos, nos quedamos solo con las variables significativas que relamente afectan a “nota_vino”, y creamos un nuevo modelo exclusivamente con ellas (“Log_sulphates”, “Log_total_sulfur_dioxide”, “volatile_acidity” y “alcohol”). De esta forma simplificamos el modelo, nos quedamos con las varibales realmente importantes para el modelo predictor y creamos el mejor modelo de regresión logística posible para nuestro conjunto de datos.
modelo_glm2 = glm(nota_vino ~ alcohol + volatile_acidity + Log_sulphates +
Log_total_sulfur_dioxide, data = train_glm, family = binomial)
modelo_glm2
##
## Call: glm(formula = nota_vino ~ alcohol + volatile_acidity + Log_sulphates +
## Log_total_sulfur_dioxide, family = binomial, data = train_glm)
##
## Coefficients:
## (Intercept) alcohol volatile_acidity
## -5.9714 0.9694 -2.7563
## Log_sulphates Log_total_sulfur_dioxide
## 2.2830 -0.3912
##
## Degrees of Freedom: 1278 Total (i.e. Null); 1274 Residual
## Null Deviance: 1767
## Residual Deviance: 1329 AIC: 1339
summary(modelo_glm2)
##
## Call:
## glm(formula = nota_vino ~ alcohol + volatile_acidity + Log_sulphates +
## Log_total_sulfur_dioxide, family = binomial, data = train_glm)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0757 -0.8344 0.2981 0.8035 2.3837
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.97144 0.98206 -6.081 1.20e-09 ***
## alcohol 0.96938 0.07907 12.260 < 2e-16 ***
## volatile_acidity -2.75634 0.41634 -6.620 3.58e-11 ***
## Log_sulphates 2.28302 0.35200 6.486 8.82e-11 ***
## Log_total_sulfur_dioxide -0.39120 0.10198 -3.836 0.000125 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1767.4 on 1278 degrees of freedom
## Residual deviance: 1328.6 on 1274 degrees of freedom
## AIC: 1338.6
##
## Number of Fisher Scoring iterations: 4
Para realizar la interpretación de los coeficientes:
round(exp(cbind(Estimate = coef(modelo_glm2), confint(modelo_glm2))),
2)
## Estimate 2.5 % 97.5 %
## (Intercept) 0.00 0.00 0.02
## alcohol 2.64 2.27 3.09
## volatile_acidity 0.06 0.03 0.14
## Log_sulphates 9.81 4.97 19.79
## Log_total_sulfur_dioxide 0.68 0.55 0.83
Los intervalos de confianza no se basan en un test de Wald (como en regresión tradicional), sino en un perfilado (profiling) de la log-likelihood, que es más preciso.
Predicción de valores del modelo:
head(predict(modelo_glm2))
## 1 2 3 4 5 6
## 0.1559899 -1.5792806 -1.4338215 -0.6054977 -0.6622377 -1.5725449
Probabilidad en escala de la salida:
head(predict(modelo_glm2, type = "response"))
## 1 2 3 4 5 6
## 0.5389186 0.1708974 0.1925039 0.3530869 0.3402371 0.1718539
Evaluación del rendimiento predictivo del modelo GLM presentado con las datos de train:
train_glm$y_pred_probs <- predict(modelo_glm2, train_glm, type = "response")
train_glm$y_pred <- ifelse(train_glm$y_pred_probs > 0.5, 1, 0)
# train_glm$y_pred_probs train_glm$y_pred
cm_train <- confusionMatrix(as.factor(train_glm$y_pred), as.factor(train_glm$nota_vino),
positive = "1")
cm_train$table
## Reference
## Prediction 0 1
## 0 445 172
## 1 152 510
# result
cm_train$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.75
cm_train$byClass["Recall"] %>%
round(2)
## Recall
## 0.75
cm_train$byClass["Precision"] %>%
round(2)
## Precision
## 0.77
Viendo el valor de las metricas obtenidas, el valor de Accuracy (número de predicciones correctas/número total de predicciones) se situa en el 75%, el de Precision (positivos verdaderos/(positivos verdaderos + falsos positivos)) se situa en un 77%, y el de Recall o Sensitividad (positivos verdaderos/(positivos verdaderos/falsos negativos)) en un 75%.
Con estos datos entendemos que con el modelo desarrollado, en alrededor del 75% de los casos este será capaz de predecir si un vino aprueba en nota porque es razonablemente bueno (nota_vino >= 6) o sino suspende porque es realmente malo (nota_vino < 6).
Tratamos de aplicar Cross Validation sobre el modelo de GLM y realizar una selección de hiperparámetros:
Vemos primero cuales son las posibles variables que tienes el modelo para tratar de configurar. Cómo se puede ver, el modelo GLM no tiene la posibilidad de ajustar hiperparámetros.
## https://machinelearningmastery.com/how-to-estimate-model-accuracy-in-r-using-the-caret-package/?msclkid=37e9f222aa8711ec9c857e7c4b89d202
## https://daviddalpiaz.github.io/r4sl/the-caret-package.html#classification
# Vemos hiperparámetros que se pueden configurar
modelLookup("glm")
## model parameter label forReg forClass probModel
## 1 glm parameter parameter TRUE TRUE TRUE
Creamos el modelo con las variables seleccionadas como relevantes y haciendo Cross Validation on 5 particiones del dataset de train.
caret.glm <- train(as.factor(nota_vino) ~ alcohol + volatile_acidity + Log_sulphates + Log_total_sulfur_dioxide,
method = "glm",
family = "binomial",
data = train_glm,
trControl = trainControl(method = "cv", number = 5))
caret.glm
## Generalized Linear Model
##
## 1279 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1024, 1024, 1023, 1022, 1023
## Resampling results:
##
## Accuracy Kappa
## 0.7482917 0.4956541
summary(caret.glm)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0757 -0.8344 0.2981 0.8035 2.3837
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.97144 0.98206 -6.081 1.20e-09 ***
## alcohol 0.96938 0.07907 12.260 < 2e-16 ***
## volatile_acidity -2.75634 0.41634 -6.620 3.58e-11 ***
## Log_sulphates 2.28302 0.35200 6.486 8.82e-11 ***
## Log_total_sulfur_dioxide -0.39120 0.10198 -3.836 0.000125 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1767.4 on 1278 degrees of freedom
## Residual deviance: 1328.6 on 1274 degrees of freedom
## AIC: 1338.6
##
## Number of Fisher Scoring iterations: 4
Con estos datos entendemos que con el modelo desarrollado, en alrededor del 74/75% de los casos este será capaz de predecir si un vino aprueba en nota porque es razonablemente bueno (nota_vino >= 6) o sino suspende porque es realmente malo (nota_vino < 6).
confusionMatrix(caret.glm)
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1
## 0 35.1 13.6
## 1 11.6 39.7
##
## Accuracy (average) : 0.7482
Evaluación del rendimiento predictivo del modelo GLM presentado con las datos de train:
train_glm$y_pred_probs2 <- predict(caret.glm, train_glm, type = "prob")
train_glm$y_pred_probs2 <- ifelse(train_glm$y_pred_probs2$`1` >
0.5, train_glm$y_pred_probs2$`1`, 1 - train_glm$y_pred_probs2$`0`)
train_glm$y_pred2 <- ifelse(train_glm$y_pred_probs2 > 0.5, 1,
0)
# train_glm$y_pred_probs2 train_glm$y_pred2
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de GLM obtenido:
cm_train2 <- confusionMatrix(as.factor(train_glm$y_pred2), as.factor(train_glm$nota_vino),
positive = "1")
cm_train2$table
## Reference
## Prediction 0 1
## 0 445 172
## 1 152 510
# result
cm_train2$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.75
cm_train2$byClass["Recall"] %>%
round(2)
## Recall
## 0.75
cm_train2$byClass["Precision"] %>%
round(2)
## Precision
## 0.77
Reproducimos la curva ROC sobre el modelo final de GLM obtenido:
roc_glm <- plot.roc(as.numeric(train_glm$nota_vino), as.numeric(train_glm$y_pred_probs2))
auc(roc_glm)
## Area under the curve: 0.8195
Se obtiene alrededor de un 82% de área bajo la curva.
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
train_knn <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_knn
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_knn$nota_vino)
##
## 0 1
## 597 682
str(train_knn)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
# REFERENCIA:https://www.edureka.co/blog/knn-algorithm-in-r/
# train_knn <- train[, colnames(train) != 'quality']
# train_knn$nota_vino <- factor(train$quality < 6, labels =
# c('aprobado','suspenso')) # levels = c('FALSE', 'TRUE')
# train_knn table(train_knn$nota_vino)
# str(train_knn)
Lo primero de todo calculamos el número de observaciones que tiene nuestro dataset en train. Queremos así ver de inicio el número de “K” o vecinos con el que cuenta nuestro conjunto de datos de entrenamiento, para posteriormente y en base a ello aproximar el óptimo valor de “K”.
NROW <- NROW(train_knn)
NROW
## [1] 1279
Para obtener el valor óptimo aproximado de “K” realizamos la raiz cuadrada del número total de observaciones del dataset de train
sqrt(1279)
## [1] 35.76311
Probaremos con 35 y 36 vecinos como una primera aproximación del “k” óptimo en un modelo de knn.
Para tratar de realizar el modelo de knn dividimos nuestros datos de train en train y validación:
numero_total = nrow(train_knn)
w_train = 0.7
w_vali = 0.3
indices = seq(1:numero_total)
indices_train = sample(1:numero_total, numero_total * w_train)
indices_vali = sample(indices[-indices_train], numero_total *
w_vali)
k_train = train_knn[indices_train, ]
k_train
## # A tibble: 895 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10.4 0.44 0.73 0.999 12 3.17
## 2 11.7 0.28 0.47 0.997 10.6 3.15
## 3 10.7 0.4 0.37 0.997 11.2 3.12
## 4 7.9 0.34 0.36 0.994 11.2 3.3
## 5 6.1 0.58 0.23 0.994 12.5 3.46
## 6 11.5 0.59 0.59 0.999 11 3.18
## 7 6.8 0.59 0.1 0.996 9.7 3.3
## 8 7.2 0.53 0.13 0.996 9.9 3.21
## 9 7.7 0.75 0.27 0.997 9.3 3.24
## 10 6.9 0.84 0.21 0.998 9.23 3.53
## # … with 885 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
k_vali = train_knn[indices_vali, ]
Probamos un modelo simple de k vecinos con K = 35 y K = 36:
knn_simple_35 <- knn(k_train[, 1:11], k_vali[, 1:11], k = 35,
cl = as.factor(k_train$nota_vino))
knn_simple_36 <- knn(k_train[, 1:11], k_vali[, 1:11], k = 36,
cl = as.factor(k_train$nota_vino))
table(knn_simple_35, as.factor(k_vali$nota_vino))
##
## knn_simple_35 0 1
## 0 141 58
## 1 42 142
table(knn_simple_36, as.factor(k_vali$nota_vino))
##
## knn_simple_36 0 1
## 0 139 57
## 1 44 143
accuracy_knn_simple_35 = sum(knn_simple_35== k_vali$nota_vino) /nrow(k_vali)
accuracy_knn_simple_36 = sum(knn_simple_36== k_vali$nota_vino) /nrow(k_vali)
error_knn_simple_35 = 1-accuracy_knn_simple_35
error_knn_simple_36 = 1-accuracy_knn_simple_36
accuracy_knn_simple_35
## [1] 0.7389034
error_knn_simple_35
## [1] 0.2610966
print("...........................")
## [1] "..........................."
accuracy_knn_simple_36
## [1] 0.7362924
error_knn_simple_36
## [1] 0.2637076
Vemos que en ambos casos los resultados son muy parecidos obteniendo un accuracy de entorno al 73/74% de precisión.
Habiendo visto el modelo base, tratamos de ir un paso más allá creando otra versión que nos permita por un lado normalizar o estandarizar ls variables para tratarlas con una magnitud equivalente, ajustar de forma más precisa hiperparámetros a través de un número de “k” óptimo que venga dado realizando validación cruzada y evaluar un modelo para ver su precisión, robustez y capacidad de generalización.
modelLookup("knn")
## model parameter label forReg forClass probModel
## 1 knn k #Neighbors TRUE TRUE TRUE
Vemos que el parámetro que podemos ajustar es el valor de “k” que son el número de vecinos más cercanos con los que compararemos las diferentes observaciones y realizaremos la clasificación teniendo en cuena la distancia euclídea entre los puntos.
Tratamos de plantear un modelo de knn que incluya un proceso de validación cruzada de 5 folds, que centre y estándarice la escala de las variables, y que ajuste el hiperparámetro k de vecinos óptimo.
set.seed(22222220)
caret.knn = train(as.factor(nota_vino) ~ ., data = train_knn,
method = "knn", trControl = trainControl(method = "cv", number = 5),
preProcess = c("center", "scale"), tuneGrid = expand.grid(k = seq(1,
101, by = 2)))
caret.knn
## k-Nearest Neighbors
##
## 1279 samples
## 11 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1023, 1023, 1023, 1023, 1024
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.7529259 0.5024638
## 3 0.7271293 0.4496676
## 5 0.7451256 0.4857637
## 7 0.7279075 0.4514521
## 9 0.7372702 0.4712542
## 11 0.7497855 0.4961405
## 13 0.7474357 0.4921884
## 15 0.7443045 0.4863425
## 17 0.7419700 0.4817875
## 19 0.7403860 0.4788672
## 21 0.7403922 0.4788520
## 23 0.7482108 0.4949287
## 25 0.7443076 0.4870466
## 27 0.7435202 0.4857680
## 29 0.7404013 0.4793634
## 31 0.7443107 0.4873547
## 33 0.7443168 0.4876106
## 35 0.7466575 0.4922032
## 37 0.7474357 0.4937484
## 39 0.7474387 0.4936762
## 41 0.7435294 0.4852850
## 43 0.7419669 0.4824611
## 45 0.7396201 0.4774906
## 47 0.7458732 0.4903228
## 49 0.7435263 0.4853138
## 51 0.7419669 0.4826133
## 53 0.7505576 0.4996890
## 55 0.7505607 0.4998633
## 57 0.7497794 0.4980862
## 59 0.7536918 0.5055618
## 61 0.7560386 0.5102242
## 63 0.7560417 0.5104059
## 65 0.7583885 0.5153964
## 67 0.7544761 0.5079165
## 69 0.7599510 0.5183972
## 71 0.7599540 0.5181401
## 73 0.7560355 0.5106282
## 75 0.7536918 0.5056467
## 77 0.7576011 0.5138357
## 79 0.7560509 0.5110409
## 81 0.7544761 0.5080859
## 83 0.7552574 0.5095624
## 85 0.7552543 0.5096503
## 87 0.7529136 0.5050209
## 89 0.7576072 0.5145375
## 91 0.7599540 0.5192917
## 93 0.7552635 0.5098494
## 95 0.7552727 0.5096173
## 97 0.7591850 0.5178204
## 99 0.7607445 0.5210021
## 101 0.7583946 0.5161423
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 99.
summary(caret.knn)
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 11 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
Conseguimos un valor de k óptimo en 99 vecinos que nos da un accuracy del 76.09% mejorando los resultados obtenidos con anterioridad.
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(caret.knn)
## k Accuracy Kappa AccuracySD KappaSD
## 1 99 0.7607445 0.5210021 0.02308807 0.04514979
Con estos datos entendemos que con el modelo desarrollado, en alrededor del 76/77% de los casos este será capaz de predecir si un vino aprueba en nota porque es razonablemente bueno (nota_vino >= 6) o sino suspende porque es realmente malo (nota_vino < 6).
Evaluación del rendimiento predictivo del modelo KNN presentado con las datos de train:
train_knn$y_pred_probs2 <- predict(caret.knn, newdata = train_knn,
type = "prob")
train_knn$y_pred_probs2 <- ifelse(train_knn$y_pred_probs2$`1` >
0.5, train_knn$y_pred_probs2$`1`, 1 - train_knn$y_pred_probs2$`0`)
train_knn$y_pred2 <- ifelse(train_knn$y_pred_probs2 > 0.5, 1,
0)
train_knn$y_pred_probs2
## [1] 0.4242424 0.3030303 0.5353535 0.3737374 0.4747475 0.3737374 0.7373737
## [8] 0.6666667 0.9191919 0.8989899 0.6969697 0.8383838 0.4444444 0.8080808
## [15] 0.7979798 0.2525253 0.8282828 0.6969697 0.8787879 0.9090909 0.8484848
## [22] 0.9393939 0.6767677 0.3939394 0.9292929 0.7575758 0.4949495 0.8300000
## [29] 0.7272727 0.8383838 0.3131313 0.3939394 0.7722772 0.8282828 0.3939394
## [36] 0.3131313 0.5454545 0.4141414 0.8989899 0.5959596 0.9494949 0.3838384
## [43] 0.5500000 0.5252525 0.7474747 0.4242424 0.6363636 0.3030303 0.5353535
## [50] 0.2525253 0.4949495 0.3535354 0.7676768 0.6464646 0.6969697 0.6565657
## [57] 0.8787879 0.5353535 0.4444444 0.5252525 0.7676768 0.8888889 0.4848485
## [64] 0.5555556 0.6666667 0.4747475 0.4646465 0.5959596 0.3636364 0.7676768
## [71] 0.5757576 0.7272727 0.2727273 0.3131313 0.2727273 0.7676768 0.3564356
## [78] 0.8686869 0.8484848 0.3939394 0.4900000 0.5555556 0.7474747 0.4400000
## [85] 0.3535354 0.4646465 0.4747475 0.2525253 0.4500000 0.6464646 0.3939394
## [92] 0.8787879 0.6868687 0.8585859 0.3232323 0.4400000 0.3333333 0.8383838
## [99] 0.7474747 0.7979798 0.2929293 0.7070707 0.4747475 0.3939394 0.4242424
## [106] 0.4848485 0.5757576 0.4040404 0.4040404 0.9292929 0.8585859 0.4600000
## [113] 0.4444444 0.2020202 0.4646465 0.3535354 0.8484848 0.3838384 0.3535354
## [120] 0.3131313 0.3232323 0.8484848 0.3939394 0.4242424 0.9191919 0.9191919
## [127] 0.4848485 0.3737374 0.9292929 0.5454545 0.8888889 0.3434343 0.8484848
## [134] 0.9494949 0.2020202 0.7272727 0.2525253 0.9696970 0.6700000 0.8282828
## [141] 0.3838384 0.7070707 0.5151515 0.3535354 0.8383838 0.6161616 0.4141414
## [148] 0.3232323 0.9292929 0.8989899 0.4040404 0.5353535 0.8787879 0.8888889
## [155] 0.2323232 0.4600000 0.7878788 0.9191919 0.8500000 0.8383838 0.7373737
## [162] 0.3939394 0.6464646 0.3838384 0.7676768 0.2727273 0.4444444 0.9090909
## [169] 0.2929293 0.2525253 0.2525253 0.9292929 0.7878788 0.6767677 0.3600000
## [176] 0.8282828 0.4444444 0.4300000 0.7900000 0.5858586 0.4949495 0.3636364
## [183] 0.5252525 0.5252525 0.6400000 0.8383838 0.5757576 0.3636364 0.9292929
## [190] 0.3939394 0.4545455 0.8888889 0.4040404 0.8181818 0.2828283 0.6060606
## [197] 0.5454545 0.8787879 0.5858586 0.9090909 0.4545455 0.9494949 0.3636364
## [204] 0.8500000 0.5151515 0.8888889 0.4300000 0.9393939 0.3131313 0.2121212
## [211] 0.2222222 0.8989899 0.5757576 0.6262626 0.8787879 0.7979798 0.8686869
## [218] 0.7676768 0.3030303 0.7200000 0.8686869 0.5050505 0.3131313 0.4141414
## [225] 0.8181818 0.4444444 0.2323232 0.4747475 0.8282828 0.6060606 0.8686869
## [232] 0.6969697 0.5353535 0.2828283 0.4444444 0.3434343 0.3333333 0.8383838
## [239] 0.8686869 0.3434343 0.4141414 0.6868687 0.2727273 0.7373737 0.6262626
## [246] 0.4444444 0.3131313 0.7600000 0.6969697 0.2222222 0.9191919 0.9090909
## [253] 0.8282828 0.6262626 0.5656566 0.8686869 0.4646465 0.6969697 0.7979798
## [260] 0.3030303 0.3333333 0.4646465 0.6666667 0.3636364 0.2929293 0.3600000
## [267] 0.3939394 0.5757576 0.5050505 0.9292929 0.3700000 0.4949495 0.5858586
## [274] 0.6464646 0.7272727 0.7777778 0.5252525 0.6400000 0.8484848 0.3636364
## [281] 0.7474747 0.7676768 0.7272727 0.9090909 0.5656566 0.5050505 0.3838384
## [288] 0.4747475 0.4848485 0.6363636 0.6262626 0.5050505 0.3700000 0.6969697
## [295] 0.2929293 0.8080808 0.9494949 0.8282828 0.3535354 0.3333333 0.8400000
## [302] 0.3434343 0.4141414 0.3838384 0.7575758 0.7878788 0.8888889 0.6161616
## [309] 0.3232323 0.8080808 0.4141414 0.9191919 0.2323232 0.4343434 0.6900000
## [316] 0.8080808 0.6363636 0.5800000 0.5757576 0.4242424 0.3535354 0.5959596
## [323] 0.4646465 0.7575758 0.5600000 0.4848485 0.5900000 0.5454545 0.3333333
## [330] 0.7575758 0.7373737 0.3333333 0.4141414 0.3636364 0.4747475 0.7272727
## [337] 0.5656566 0.6464646 0.8181818 0.2020202 0.4141414 0.4545455 0.7373737
## [344] 0.2020202 0.7979798 0.8282828 0.5757576 0.6262626 0.5757576 0.3137255
## [351] 0.3100000 0.2222222 0.5858586 0.5151515 0.9494949 0.3636364 0.5454545
## [358] 0.2626263 0.4242424 0.2626263 0.9090909 0.9090909 0.6969697 0.3800000
## [365] 0.7373737 0.4242424 0.3939394 0.4949495 0.9494949 0.3535354 0.5252525
## [372] 0.2828283 0.9191919 0.4646465 0.8181818 0.2626263 0.3535354 0.6666667
## [379] 0.7676768 0.3636364 0.4700000 0.5858586 0.6262626 0.7171717 0.9090909
## [386] 0.6767677 0.5858586 0.4545455 0.4747475 0.6868687 0.8200000 0.6262626
## [393] 0.3838384 0.7575758 0.2525253 0.4747475 0.3535354 0.8585859 0.3636364
## [400] 0.3535354 0.4848485 0.5656566 0.7900000 0.5252525 0.7200000 0.4100000
## [407] 0.5858586 0.7777778 0.6868687 0.3030303 0.4949495 0.5656566 0.4242424
## [414] 0.2929293 0.4545455 0.6161616 0.3636364 0.3200000 0.7474747 0.3939394
## [421] 0.2323232 0.7171717 0.6868687 0.2727273 0.7878788 0.4343434 0.4040404
## [428] 0.5858586 0.5050505 0.2727273 0.7500000 0.6363636 0.4545455 0.7474747
## [435] 0.7474747 0.6100000 0.2121212 0.6767677 0.3535354 0.3535354 0.5555556
## [442] 0.5454545 0.7474747 0.7777778 0.7474747 0.8700000 0.7272727 0.4949495
## [449] 0.6666667 0.8787879 0.7878788 0.8080808 0.2222222 0.7600000 0.6666667
## [456] 0.5858586 0.4141414 0.3333333 0.8080808 0.3030303 0.4848485 0.7070707
## [463] 0.3535354 0.6363636 0.8181818 0.7272727 0.7777778 0.8181818 0.1919192
## [470] 0.3333333 0.8888889 0.5050505 0.8585859 0.7000000 0.5656566 0.5757576
## [477] 0.3939394 0.7474747 0.4242424 0.8585859 0.3838384 0.3939394 0.3030303
## [484] 0.8787879 0.2929293 0.6262626 0.8282828 0.7272727 0.4800000 0.2828283
## [491] 0.2772277 0.8383838 0.5247525 0.4545455 0.5841584 0.2323232 0.3434343
## [498] 0.5555556 0.3434343 0.6868687 0.3737374 0.8585859 0.8585859 0.3232323
## [505] 0.4545455 0.6767677 0.5454545 0.3535354 0.2323232 0.3838384 0.8989899
## [512] 0.6161616 0.5151515 0.7777778 0.4646465 0.7171717 0.2424242 0.8484848
## [519] 0.2929293 0.7575758 0.3535354 0.8400000 0.2121212 0.2727273 0.5151515
## [526] 0.6900000 0.6868687 0.2525253 0.9090909 0.4343434 0.8686869 0.3838384
## [533] 0.4949495 0.8383838 0.7272727 0.8484848 0.8000000 0.5858586 0.3636364
## [540] 0.7171717 0.2121212 0.8282828 0.8900000 0.5151515 0.4242424 0.8888889
## [547] 0.7171717 0.7500000 0.5050505 0.2700000 0.3232323 0.3838384 0.7272727
## [554] 0.7272727 0.6262626 0.4040404 0.3636364 0.7272727 0.4900000 0.3131313
## [561] 0.2828283 0.4040404 0.6363636 0.4059406 0.5800000 0.7878788 0.4646465
## [568] 0.6868687 0.3232323 0.6969697 0.6767677 0.5757576 0.3131313 0.5151515
## [575] 0.3100000 0.3737374 0.6100000 0.5757576 0.4242424 0.8686869 0.5757576
## [582] 0.4242424 0.9494949 0.3100000 0.3535354 0.2222222 0.4040404 0.5252525
## [589] 0.5858586 0.3636364 0.4444444 0.6666667 0.6464646 0.6565657 0.4343434
## [596] 0.7900000 0.3939394 0.2525253 0.3838384 0.4747475 0.6767677 0.4444444
## [603] 0.7777778 0.6900000 0.6464646 0.5353535 0.8888889 0.4242424 0.8900000
## [610] 0.3838384 0.4444444 0.4848485 0.7474747 0.6565657 0.3030303 0.4343434
## [617] 0.5252525 0.5454545 0.8686869 0.4343434 0.6666667 0.8300000 0.3600000
## [624] 0.3535354 0.7676768 0.3030303 0.4545455 0.3232323 0.7272727 0.8484848
## [631] 0.6400000 0.3131313 0.6565657 0.8383838 0.8282828 0.7722772 0.1919192
## [638] 0.7171717 0.3535354 0.7676768 0.7373737 0.4646465 0.5151515 0.6767677
## [645] 0.2626263 0.6969697 0.8787879 0.4141414 0.6900000 0.3737374 0.3838384
## [652] 0.6666667 0.5151515 0.4545455 0.3131313 0.8181818 0.4242424 0.8080808
## [659] 0.2323232 0.3737374 0.6900000 0.8787879 0.8080808 0.7373737 0.3737374
## [666] 0.7070707 0.3636364 0.3232323 0.4747475 0.3300000 0.8282828 0.3300000
## [673] 0.3838384 0.4343434 0.4646465 0.2727273 0.7979798 0.3535354 0.4848485
## [680] 0.6464646 0.3434343 0.4545455 0.5050505 0.2020202 0.4343434 0.7777778
## [687] 0.3030303 0.7070707 0.3030303 0.2424242 0.5353535 0.8686869 0.3939394
## [694] 0.8686869 0.5151515 0.8484848 0.3636364 0.9090909 0.7272727 0.2323232
## [701] 0.4545455 0.3434343 0.2020202 0.4343434 0.4040404 0.3535354 0.8989899
## [708] 0.2525253 0.3535354 0.8080808 0.2525253 0.4242424 0.4444444 0.7979798
## [715] 0.5252525 0.7272727 0.7777778 0.5050505 0.8181818 0.5555556 0.7373737
## [722] 0.7272727 0.1919192 0.7100000 0.7373737 0.4000000 0.2525253 0.2121212
## [729] 0.3232323 0.3131313 0.3434343 0.3434343 0.4747475 0.4900000 0.2424242
## [736] 0.2424242 0.3030303 0.4545455 0.8282828 0.3535354 0.7777778 0.2727273
## [743] 0.8484848 0.3131313 0.6100000 0.7575758 0.3838384 0.4646465 0.2323232
## [750] 0.8585859 0.4747475 0.8686869 0.2727273 0.3838384 0.6060606 0.3636364
## [757] 0.6700000 0.4141414 0.7676768 0.8282828 0.3535354 0.5757576 0.8181818
## [764] 0.5858586 0.6161616 0.3232323 0.6161616 0.3838384 0.2424242 0.9292929
## [771] 0.3030303 0.9393939 0.4848485 0.3232323 0.3636364 0.7878788 0.5454545
## [778] 0.7878788 0.3939394 0.4747475 0.8888889 0.7171717 0.7272727 0.2424242
## [785] 0.6767677 0.4444444 0.3838384 0.5656566 0.4600000 0.7373737 0.2323232
## [792] 0.2700000 0.3737374 0.8585859 0.9292929 0.4141414 0.5555556 0.8989899
## [799] 0.5858586 0.3232323 0.6868687 0.7575758 0.8282828 0.8282828 0.3535354
## [806] 0.4848485 0.6767677 0.3636364 0.6161616 0.4040404 0.3838384 0.7373737
## [813] 0.2424242 0.4040404 0.6262626 0.4700000 0.6969697 0.5555556 0.3030303
## [820] 0.4747475 0.5000000 0.4646465 0.3535354 0.3535354 0.8300000 0.4343434
## [827] 0.3535354 0.4646465 0.7575758 0.4141414 0.8080808 0.6868687 0.4141414
## [834] 0.2626263 0.1717172 0.3131313 0.9090909 0.3030303 0.4646465 0.4848485
## [841] 0.4400000 0.3838384 0.3939394 0.8181818 0.2020202 0.4646465 0.4040404
## [848] 0.7171717 0.2525253 0.6666667 0.6565657 0.4242424 0.2626263 0.6969697
## [855] 0.7979798 0.5151515 0.7676768 0.2727273 0.3861386 0.4646465 0.2424242
## [862] 0.5858586 0.2626263 0.3636364 0.4747475 0.5252525 0.3838384 0.4646465
## [869] 0.6868687 0.8585859 0.5959596 0.4343434 0.6464646 0.3636364 0.8282828
## [876] 0.6363636 0.2222222 0.3838384 0.6767677 0.8787879 0.3636364 0.6666667
## [883] 0.3535354 0.4400000 0.6464646 0.7878788 0.4343434 0.3535354 0.2323232
## [890] 0.4444444 0.7500000 0.3636364 0.6969697 0.3838384 0.8400000 0.2222222
## [897] 0.4545455 0.7171717 0.3232323 0.8282828 0.3333333 0.3030303 0.3737374
## [904] 0.4040404 0.9191919 0.3434343 0.6969697 0.3232323 0.8484848 0.2020202
## [911] 0.8888889 0.3838384 0.4141414 0.7575758 0.8484848 0.2727273 0.3636364
## [918] 0.4949495 0.8484848 0.3232323 0.5454545 0.6060606 0.3333333 0.1818182
## [925] 0.7100000 0.8585859 0.8080808 0.8888889 0.4400000 0.3232323 0.5353535
## [932] 0.4343434 0.1818182 0.6161616 0.3737374 0.3900000 0.3434343 0.8600000
## [939] 0.8585859 0.4700000 0.2424242 0.3737374 0.3131313 0.6868687 0.3232323
## [946] 0.6060606 0.4000000 0.4040404 0.7979798 0.8282828 0.5656566 0.2323232
## [953] 0.3838384 0.1900000 0.7171717 0.2277228 0.5252525 0.5959596 0.6868687
## [960] 0.6700000 0.2323232 0.8888889 0.1919192 0.4040404 0.4545455 0.6161616
## [967] 0.6161616 0.4848485 0.7070707 0.4141414 0.5353535 0.2424242 0.3333333
## [974] 0.4545455 0.1515152 0.8484848 0.3939394 0.3434343 0.3333333 0.6565657
## [981] 0.2727273 0.2424242 0.8181818 0.2424242 0.8787879 0.3535354 0.4848485
## [988] 0.8181818 0.7600000 0.7979798 0.5656566 0.7777778 0.3838384 0.4949495
## [995] 0.2828283 0.4600000 0.4444444 0.8787879 0.4040404 0.6363636 0.3838384
## [1002] 0.5800000 0.1919192 0.3434343 0.2222222 0.4646465 0.7070707 0.2929293
## [1009] 0.8888889 0.8181818 0.3600000 0.5252525 0.2626263 0.8500000 0.5454545
## [1016] 0.3737374 0.3737374 0.2222222 0.5252525 0.2323232 0.5959596 0.3737374
## [1023] 0.2929293 0.4646465 0.3535354 0.5757576 0.2929293 0.4646465 0.6767677
## [1030] 0.3333333 0.5100000 0.6464646 0.3030303 0.2525253 0.3636364 0.5050505
## [1037] 0.7100000 0.8484848 0.2700000 0.3030303 0.3131313 0.4444444 0.4300000
## [1044] 0.6363636 0.3636364 0.5858586 0.5555556 0.2222222 0.3636364 0.8181818
## [1051] 0.7272727 0.3030303 0.2323232 0.9494949 0.4848485 0.4343434 0.2828283
## [1058] 0.5151515 0.5400000 0.3737374 0.7474747 0.2300000 0.3900000 0.4141414
## [1065] 0.7878788 0.4747475 0.4545455 0.2828283 0.9393939 0.7676768 0.7777778
## [1072] 0.8585859 0.8686869 0.5353535 0.3333333 0.7474747 0.3232323 0.7474747
## [1079] 0.2727273 0.5959596 0.8383838 0.2323232 0.3636364 0.7878788 0.2323232
## [1086] 0.3535354 0.2121212 0.4242424 0.5700000 0.4949495 0.2727273 0.5050505
## [1093] 0.3434343 0.7373737 0.6969697 0.3333333 0.5454545 0.3333333 0.4343434
## [1100] 0.6161616 0.6969697 0.5757576 0.4646465 0.6363636 0.4343434 0.7575758
## [1107] 0.2323232 0.6969697 0.9494949 0.6262626 0.6666667 0.4646465 0.4040404
## [1114] 0.9200000 0.2222222 0.2000000 0.5959596 0.6464646 0.6161616 0.8514851
## [1121] 0.7373737 0.8989899 0.3131313 0.9200000 0.3939394 0.2929293 0.3131313
## [1128] 0.4040404 0.6000000 0.4848485 0.2121212 0.8585859 0.3131313 0.4040404
## [1135] 0.5252525 0.2323232 0.5200000 0.4646465 0.2525253 0.3333333 0.2121212
## [1142] 0.3131313 0.3737374 0.3939394 0.3030303 0.4400000 0.4141414 0.2626263
## [1149] 0.5454545 0.7070707 0.4747475 0.3939394 0.3737374 0.8686869 0.5353535
## [1156] 0.8787879 0.4848485 0.2828283 0.8787879 0.4343434 0.5959596 0.7575758
## [1163] 0.4040404 0.4343434 0.2626263 0.7878788 0.8181818 0.3838384 0.5252525
## [1170] 0.5656566 0.4545455 0.3400000 0.3636364 0.4848485 0.8787879 0.8989899
## [1177] 0.4040404 0.3800000 0.4848485 0.8383838 0.2929293 0.8585859 0.5353535
## [1184] 0.3737374 0.4242424 0.2727273 0.4545455 0.6969697 0.6060606 0.5151515
## [1191] 0.3535354 0.6464646 0.7373737 0.5858586 0.3737374 0.6969697 0.4141414
## [1198] 0.4646465 0.2323232 0.3333333 0.7878788 0.8484848 0.8383838 0.6900000
## [1205] 0.5252525 0.6767677 0.3939394 0.7100000 0.2727273 0.6464646 0.3100000
## [1212] 0.8585859 0.3838384 0.3434343 0.8383838 0.2626263 0.3636364 0.3737374
## [1219] 0.6161616 0.9100000 0.7373737 0.2626263 0.2525253 0.2727273 0.8500000
## [1226] 0.3434343 0.5454545 0.8600000 0.7100000 0.4343434 0.4949495 0.5050505
## [1233] 0.5151515 0.3030303 0.9595960 0.5858586 0.5151515 0.3000000 0.3434343
## [1240] 0.8080808 0.6262626 0.2929293 0.7676768 0.8484848 0.3838384 0.3636364
## [1247] 0.3535354 0.9191919 0.4141414 0.9090909 0.6868687 0.8585859 0.2828283
## [1254] 0.1818182 0.6100000 0.8989899 0.7878788 0.8585859 0.6060606 0.9393939
## [1261] 0.3535354 0.4040404 0.9090909 0.3939394 0.3535354 0.2525253 0.5252525
## [1268] 0.9393939 0.7878788 0.8686869 0.8585859 0.7979798 0.6161616 0.6262626
## [1275] 0.3131313 0.6969697 0.3333333 0.2323232 0.4949495
train_knn
## # A tibble: 1,279 × 14
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 8 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>,
## # y_pred_probs2 <dbl>, y_pred2 <dbl>
# train_knn$y_pred2
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de KNN obtenido:
cm_train_knn <- confusionMatrix(as.factor(train_knn$y_pred2),
as.factor(train_knn$nota_vino), positive = "1")
cm_train_knn$table
## Reference
## Prediction 0 1
## 0 463 164
## 1 134 518
# result
cm_train_knn$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.77
cm_train_knn$byClass["Recall"] %>%
round(2)
## Recall
## 0.76
cm_train_knn$byClass["Precision"] %>%
round(2)
## Precision
## 0.79
Reproducimos la curva ROC sobre el modelo final de KNN obtenido:
roc_knn <- plot.roc(as.numeric(train_knn$nota_vino), as.numeric(train_knn$y_pred_probs2))
auc(roc_knn)
## Area under the curve: 0.8268
Se obtiene alrededor de un 82/83% de área bajo la curva.
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
# train_tree <- train[, colnames(train)!='quality']
# train_tree$nota_vino <- factor(train$quality < 6, labels
# = c('aprobado', 'suspenso')) train_tree str(train_tree)
train_tree <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_tree
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_tree$nota_vino)
##
## 0 1
## 597 682
str(train_tree)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
Creamos un modelo de árbol de decisión inicial básico y sin podar:
# árbol de clasificación con las opciones por defecto (cp = 0.01 y split = "gini") con el comando:
tree = rpart(as.factor(nota_vino) ~ ., data = train_tree, cp=0.006)
rpart.plot(tree, nn = TRUE, extra = 104, box.palette = "GnBu", branch.lty = 3, shadow.col = "gray")
tree
## n= 1279
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1279 597 1 (0.4667709 0.5332291)
## 2) alcohol< 10.525 779 280 0 (0.6405648 0.3594352)
## 4) Log_sulphates< -0.4700356 494 129 0 (0.7388664 0.2611336) *
## 5) Log_sulphates>=-0.4700356 285 134 1 (0.4701754 0.5298246)
## 10) volatile_acidity>=0.545 103 33 0 (0.6796117 0.3203883)
## 20) Log_chlorides>=-2.333097 36 4 0 (0.8888889 0.1111111) *
## 21) Log_chlorides< -2.333097 67 29 0 (0.5671642 0.4328358)
## 42) fixed_acidity< 9.55 59 21 0 (0.6440678 0.3559322) *
## 43) fixed_acidity>=9.55 8 0 1 (0.0000000 1.0000000) *
## 11) volatile_acidity< 0.545 182 64 1 (0.3516484 0.6483516)
## 22) Log_total_sulfur_dioxide>=4.166635 42 17 0 (0.5952381 0.4047619) *
## 23) Log_total_sulfur_dioxide< 4.166635 140 39 1 (0.2785714 0.7214286) *
## 3) alcohol>=10.525 500 98 1 (0.1960000 0.8040000)
## 6) volatile_acidity>=0.87 24 6 0 (0.7500000 0.2500000) *
## 7) volatile_acidity< 0.87 476 80 1 (0.1680672 0.8319328) *
Analizamos los resultados obtenidos de forma numérica:
rpart.rules(tree, style = "tall")
## as.factor(nota_vino) is 0.11 when
## alcohol < 11
## volatile_acidity >= 0.55
## Log_sulphates >= -0.47
## Log_chlorides >= -2.3
##
## as.factor(nota_vino) is 0.25 when
## alcohol >= 11
## volatile_acidity >= 0.87
##
## as.factor(nota_vino) is 0.26 when
## alcohol < 11
## Log_sulphates < -0.47
##
## as.factor(nota_vino) is 0.36 when
## alcohol < 11
## volatile_acidity >= 0.55
## Log_sulphates >= -0.47
## Log_chlorides < -2.3
## fixed_acidity < 9.6
##
## as.factor(nota_vino) is 0.40 when
## alcohol < 11
## volatile_acidity < 0.55
## Log_sulphates >= -0.47
## Log_total_sulfur_dioxide >= 4.2
##
## as.factor(nota_vino) is 0.72 when
## alcohol < 11
## volatile_acidity < 0.55
## Log_sulphates >= -0.47
## Log_total_sulfur_dioxide < 4.2
##
## as.factor(nota_vino) is 0.83 when
## alcohol >= 11
## volatile_acidity < 0.87
##
## as.factor(nota_vino) is 1.00 when
## alcohol < 11
## volatile_acidity >= 0.55
## Log_sulphates >= -0.47
## Log_chlorides < -2.3
## fixed_acidity >= 9.6
obs_tree1 <- as.factor(train_tree$nota_vino)
head(predict(tree, newdata = train_tree))
## 0 1
## 1 0.7388664 0.2611336
## 2 0.7388664 0.2611336
## 3 0.7388664 0.2611336
## 4 0.5952381 0.4047619
## 5 0.6440678 0.3559322
## 6 0.7388664 0.2611336
pred_tree1 <- predict(tree, newdata = train_tree, type = "class")
table(obs_tree1, pred_tree1)
## pred_tree1
## obs_tree1 0 1
## 0 478 119
## 1 177 505
caret::confusionMatrix(pred_tree1, obs_tree1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 478 177
## 1 119 505
##
## Accuracy : 0.7686
## 95% CI : (0.7445, 0.7914)
## No Information Rate : 0.5332
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5379
##
## Mcnemar's Test P-Value : 0.0009228
##
## Sensitivity : 0.8007
## Specificity : 0.7405
## Pos Pred Value : 0.7298
## Neg Pred Value : 0.8093
## Prevalence : 0.4668
## Detection Rate : 0.3737
## Detection Prevalence : 0.5121
## Balanced Accuracy : 0.7706
##
## 'Positive' Class : 0
##
Obtenemos un valor del 76.86% para la precisión del modelo, con el incoveniente de tener un modelo sin poda, demasiado complejo y que puede tender al sobreajuste.
Realizamos la valoración para una posible poda del modelo que permita simplificarlo y hacerlo más explicativo sin perder capacidad predictora. Para ello vemos el CP o “Parámetro de complejidad” con el cual buscamos el árbol menos profundo que además minimice la tasa de error.
plotcp(tree) #CP - PARÁMETRO DE COMPLEJIDAD: Buscamos el árbol menos profundo que además minimiza la tasa de error
printcp(tree)
##
## Classification tree:
## rpart(formula = as.factor(nota_vino) ~ ., data = train_tree,
## cp = 0.006)
##
## Variables actually used in tree construction:
## [1] alcohol fixed_acidity Log_chlorides
## [4] Log_sulphates Log_total_sulfur_dioxide volatile_acidity
##
## Root node error: 597/1279 = 0.46677
##
## n= 1279
##
## CP nsplit rel error xerror xstd
## 1 0.3668342 0 1.00000 1.00000 0.029886
## 2 0.0452261 1 0.63317 0.65159 0.027559
## 3 0.0201005 3 0.54271 0.58291 0.026660
## 4 0.0134003 4 0.52261 0.56951 0.026464
## 5 0.0067002 5 0.50921 0.56951 0.026464
## 6 0.0060000 7 0.49581 0.56114 0.026339
Finalmente decimos proceder a realizar la poda y crear un modelo alternativo más simplificado:
xerror <- tree$cptable[, "xerror"]
imin.xerror <- which.min(xerror)
upper.xerror <- xerror[imin.xerror] + tree$cptable[imin.xerror,
"xstd"]
icp <- min(which(xerror <= upper.xerror))
cp <- tree$cptable[icp, "CP"]
cp
## [1] 0.0201005
tree_2 <- prune(tree, cp = cp)
# tree summary(tree) caret::varImp(tree) importance <-
# tree$variable.importance importance <-
# round(100*importance/sum(importance), 1)
# importance[importance >= 1]
rpart.plot(tree_2, nn = TRUE, extra = 104, box.palette = "GnBu",
branch.lty = 3, shadow.col = "gray") #, main='Classification tree winetaste'
obs_tree2 <- as.factor(train_tree$nota_vino)
head(predict(tree_2, newdata = train_tree))
## 0 1
## 1 0.7388664 0.2611336
## 2 0.7388664 0.2611336
## 3 0.7388664 0.2611336
## 4 0.3516484 0.6483516
## 5 0.6796117 0.3203883
## 6 0.7388664 0.2611336
pred_tree2 <- predict(tree_2, newdata = train_tree, type = "class")
table(obs_tree2, pred_tree2)
## pred_tree2
## obs_tree2 0 1
## 0 435 162
## 1 162 520
caret::confusionMatrix(pred_tree2, obs_tree2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 435 162
## 1 162 520
##
## Accuracy : 0.7467
## 95% CI : (0.7219, 0.7703)
## No Information Rate : 0.5332
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.4911
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7286
## Specificity : 0.7625
## Pos Pred Value : 0.7286
## Neg Pred Value : 0.7625
## Prevalence : 0.4668
## Detection Rate : 0.3401
## Detection Prevalence : 0.4668
## Balanced Accuracy : 0.7456
##
## 'Positive' Class : 0
##
Aplicando la poda a nuestro árbol obtenemos un modelo mas limpio, simple, explicativo y generalizable a otro conjunto de datos, evitando el posible sobreajuste del modelo y solo reduciendo su capacidad predictora a un valor de precisión del 74.67%. Entendemos que este modelo podado será el óptimo en este caso.
De cara a obtener el mejor modelo posible realizaremos validación cruzada de 5 folds y trataremos de ajustar hiperparámetros (el “cp” óptimo para un modelo ya validado). Utilizamos además las variables que hemos vito como más representativas y explicativas de la variable respuesta “nota_vino”.
# Fit the model on the training set
set.seed(1234)
caret.tree <- train(as.factor(nota_vino) ~ alcohol + volatile_acidity +
Log_sulphates + Log_total_sulfur_dioxide, data = train_tree,
method = "rpart", trControl = trainControl("cv", number = 5),
tuneLength = 20)
# Plot model accuracy vs different values of cp (complexity
# parameter)
plot(caret.tree)
caret.tree
## CART
##
## 1279 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1022, 1024, 1023, 1023, 1024
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.00000000 0.7239386 0.4455300
## 0.01930706 0.7255101 0.4484141
## 0.03861412 0.7215885 0.4421371
## 0.05792118 0.7004855 0.4069132
## 0.07722825 0.7004855 0.4069132
## 0.09653531 0.7004855 0.4069132
## 0.11584237 0.7004855 0.4069132
## 0.13514943 0.7004855 0.4069132
## 0.15445649 0.7004855 0.4069132
## 0.17376355 0.7004855 0.4069132
## 0.19307062 0.7004855 0.4069132
## 0.21237768 0.7004855 0.4069132
## 0.23168474 0.7004855 0.4069132
## 0.25099180 0.7004855 0.4069132
## 0.27029886 0.7004855 0.4069132
## 0.28960592 0.7004855 0.4069132
## 0.30891299 0.7004855 0.4069132
## 0.32822005 0.7004855 0.4069132
## 0.34752711 0.7004855 0.4069132
## 0.36683417 0.6614230 0.3137864
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01930706.
caret.tree$bestTune
## cp
## 2 0.01930706
Realizando la validación cruzada vemos que el CP óptimo para nuestro modelo de árbol de decisión se encuentra en 0.01930706.
Visualizamos graficamente el árbol obtenido:
# Plot the final tree model
par(xpd = NA) # Avoid clipping the text in some device
plot(caret.tree$finalModel,uniform=TRUE)
text(caret.tree$finalModel, digits = 10)
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(caret.tree)
## cp Accuracy Kappa AccuracySD KappaSD
## 1 0.01930706 0.7255101 0.4484141 0.02542086 0.05203991
Obtenemos finalmente haciendo validación cruzada una precisión del 72/73%, con un modelo que ha sido comprobado como robusto y generalizable para funcionar previsiblemente en otro conjunto de datos diferente.
Evaluación del rendimiento predictivo del modelo Decision Tree presentado con las datos de train:
train_tree$y_pred_probs2 <- predict(caret.tree, newdata = train_tree,
type = "prob")
train_tree$y_pred_probs2 <- ifelse(train_tree$y_pred_probs2$`1` >
0.5, train_tree$y_pred_probs2$`1`, 1 - train_tree$y_pred_probs2$`0`)
train_tree$y_pred2 <- ifelse(train_tree$y_pred_probs2 > 0.5,
1, 0)
# train_tree$y_pred_probs2 train_tree$y_pred2
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de Decision Tree obtenido:
cm_train_tree <- confusionMatrix(as.factor(train_tree$y_pred2),
as.factor(train_tree$nota_vino), positive = "1")
cm_train_tree$table
## Reference
## Prediction 0 1
## 0 453 168
## 1 144 514
# result
cm_train_tree$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.76
cm_train_tree$byClass["Recall"] %>%
round(2)
## Recall
## 0.75
cm_train_tree$byClass["Precision"] %>%
round(2)
## Precision
## 0.78
Reproducimos la curva ROC sobre el modelo final de Decision Tree obtenido:
roc_tree <- plot.roc(as.numeric(train_tree$nota_vino), as.numeric(train_tree$y_pred_probs2))
auc(roc_tree)
## Area under the curve: 0.7798
Se obtiene alrededor de un 78% de área bajo la curva.
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
train_forest <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_forest
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_forest$nota_vino)
##
## 0 1
## 597 682
str(train_forest)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
# train_forest <- train[, colnames(train)!='quality']
# train_forest$nota_vino <- factor(train$quality < 6,
# labels = c('aprobado', 'suspenso')) train_forest
# str(train_forest)
Creamos el modelo base de bosque de árboles:
Examinamos la convergencia del error en las muestras:
plot(rf,main="")
legend("right", colnames(rf$err.rate), lty = 1:5, col = 1:6)
Vemos la relevancia de las variables en el modelo (vemos que la variable clave que más afecta al accuracy del modelo es “alcohol”)
varImpPlot(rf)
Vemos que el principal pa´rametro a configurar es el número de predictores al azar que toma el modelo.
modelLookup("rf")
## model parameter label forReg forClass probModel
## 1 rf mtry #Randomly Selected Predictors TRUE TRUE TRUE
Creamos un modelo aplicando la validación cruzada y ajustando hiperparámetros (mtry, número de árboles y el tamaño de los nodos para regular su profundidad) de tal forma que creemos un modelo robusto y generalizable. Tomamos como base las 4 variable de mayor relevancia que hemos observado:
# Fit the model on the training set
set.seed(12345)
caret.rf <- train(as.factor(nota_vino) ~ alcohol + volatile_acidity +
Log_sulphates + Log_total_sulfur_dioxide, data = train_forest,
method = "rf", ntree = 100, importance = TRUE, metric = "Accuracy",
trControl = trainControl("cv", number = 5, search = "grid"),
nodesize = 50, tuneLength = 10)
## note: only 3 unique complexity parameters in default grid. Truncating the grid to 3 .
plot(caret.rf)
caret.rf
## Random Forest
##
## 1279 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1023, 1022, 1023, 1024, 1024
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7599436 0.5182558
## 3 0.7575968 0.5141214
## 4 0.7474160 0.4941438
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
caret.rf$bestTune
## mtry
## 1 2
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(caret.rf)
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.7599436 0.5182558 0.02457311 0.04918626
Evaluación del rendimiento predictivo del modelo Random Forest presentado con las datos de train:
train_forest$y_pred_probs2 <- predict(caret.rf, newdata = train_forest,
type = "prob")
train_forest$y_pred_probs2 <- ifelse(train_forest$y_pred_probs2$`1` >
0.5, train_forest$y_pred_probs2$`1`, 1 - train_forest$y_pred_probs2$`0`)
train_forest$y_pred2 <- ifelse(train_forest$y_pred_probs2 > 0.5,
1, 0)
# train_forest$y_pred_probs2
train_forest$y_pred2
## [1] 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1 0 0 1
## [38] 0 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 1 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 1 1 0 0
## [75] 0 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 1 1
## [112] 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 1 1 0 0
## [149] 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 1 1 0 0 1 1 1
## [186] 1 1 1 1 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 1 1 0 1 1 1
## [223] 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0 1 0 1 1 0 0 1 1 0 1 1 1 1 0 1 0 1 1
## [260] 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 0 1
## [297] 1 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0
## [334] 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 1 0 1 1 1 0
## [371] 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 0 0 1 1 1 0 1 1 1
## [408] 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 1 0 1 1
## [445] 1 1 1 0 1 1 1 1 0 1 0 1 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 1 1 0
## [482] 0 0 1 0 0 1 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 0 0 0 1
## [519] 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1
## [556] 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 1 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1 0 0 1
## [593] 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 0 1
## [630] 1 1 0 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 1 1 0 1
## [667] 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1 0 1 1 0 0 0 0
## [704] 1 1 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 1 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
## [741] 1 0 1 0 1 1 0 0 0 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 1 0
## [778] 1 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 1 0 0 0 0 1 0 0
## [815] 0 0 1 1 0 0 1 0 0 0 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0 1 1
## [852] 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 0
## [889] 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1
## [926] 1 1 1 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 0 1 1 0 1
## [963] 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0
## [1000] 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 1 1 0 1 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1
## [1037] 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 1 1 1 1 1
## [1074] 0 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 0 1 1 0
## [1111] 0 0 0 1 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0
## [1148] 0 1 1 0 0 0 1 1 1 0 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0 1 1 1 0 1 1 0 1 0 0
## [1185] 0 0 0 1 1 1 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 1 0 0 1 0 0 0 1 1 1
## [1222] 0 0 0 1 0 1 1 1 0 0 1 0 0 1 1 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1
## [1259] 1 1 0 0 1 1 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1
train_forest
## # A tibble: 1,279 × 14
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 8 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>,
## # y_pred_probs2 <dbl>, y_pred2 <dbl>
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de Random Forest obtenido:
cm_train_forest <- confusionMatrix(as.factor(train_forest$y_pred2),
as.factor(train_forest$nota_vino), positive = "1")
cm_train_forest$table
## Reference
## Prediction 0 1
## 0 495 131
## 1 102 551
# result
cm_train_forest$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.82
cm_train_forest$byClass["Recall"] %>%
round(2)
## Recall
## 0.81
cm_train_forest$byClass["Precision"] %>%
round(2)
## Precision
## 0.84
Reproducimos la curva ROC sobre el modelo final de Random Forest obtenido:
roc_rf <- plot.roc(as.numeric(train_forest$nota_vino), as.numeric(train_forest$y_pred_probs2))
auc(roc_rf)
## Area under the curve: 0.9017
Comprobación del modelo con los datos de test:
Pasamos a validar la capacidad predictora de nuestro modelo de árbol de decisión con el conjunto de datos de test. Para ello lo primero de todo, creamos de nuevo la variable binaria “nota_vino” sobre nuestro conjunto de datos en test.
test_forest <- test %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
test_forest
## # A tibble: 320 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 0.998 9.4 3.51
## 2 7.3 0.65 0 0.995 10 3.39
## 3 8.9 0.22 0.48 0.997 9.4 3.39
## 4 7.6 0.41 0.24 0.996 9.5 3.28
## 5 7.1 0.71 0 0.997 9.4 3.47
## 6 5.7 1.13 0.09 0.994 9.8 3.5
## 7 7.3 0.45 0.36 0.998 10.5 3.33
## 8 8.1 0.66 0.22 0.997 10.3 3.3
## 9 6.8 0.67 0.02 0.996 9.5 3.48
## 10 5.6 0.31 0.37 0.995 9.2 3.32
## # … with 310 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(test_forest$nota_vino)
##
## 0 1
## 147 173
str(test_forest)
## tibble [320 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:320] 7.4 7.3 8.9 7.6 7.1 5.7 7.3 8.1 6.8 5.6 ...
## $ volatile_acidity : num [1:320] 0.7 0.65 0.22 0.41 0.71 1.13 0.45 0.66 0.67 0.31 ...
## $ citric_acid : num [1:320] 0 0 0.48 0.24 0 0.09 0.36 0.22 0.02 0.37 ...
## $ density : num [1:320] 0.998 0.995 0.997 0.996 0.997 ...
## $ alcohol : num [1:320] 9.4 10 9.4 9.5 9.4 9.8 10.5 10.3 9.5 9.2 ...
## $ pH : num [1:320] 3.51 3.39 3.39 3.28 3.47 3.5 3.33 3.3 3.48 3.32 ...
## $ Log_residual_sugar : num [1:320] 0.642 0.182 0.588 0.588 0.642 ...
## $ Log_chlorides : num [1:320] -2.58 -2.73 -2.56 -2.53 -2.53 ...
## $ Log_free_sulfur_dioxide : num [1:320] 2.4 2.71 3.37 1.39 2.64 ...
## $ Log_total_sulfur_dioxide: num [1:320] 3.53 3.04 4.09 2.4 3.56 ...
## $ Log_sulphates : num [1:320] -0.58 -0.755 -0.635 -0.528 -0.598 ...
## $ nota_vino : num [1:320] 0 1 1 0 0 0 0 0 0 0 ...
test_forest$y_pred_probs2 <- predict(caret.rf, newdata = test_forest,
type = "prob")
test_forest$y_pred_probs2 <- ifelse(test_forest$y_pred_probs2$`1` >
0.5, test_forest$y_pred_probs2$`1`, 1 - test_forest$y_pred_probs2$`0`)
test_forest$y_pred2 <- ifelse(test_forest$y_pred_probs2 > 0.5,
1, 0)
# test_forest$y_pred_probs2
test_forest$y_pred2
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0
## [38] 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 0
## [75] 0 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 0 0
## [112] 0 1 0 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1
## [149] 0 0 0 0 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 1
## [186] 1 0 0 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 0 0 1 1 0 0 0 0 1 1 0 0 0 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 1 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0
## [260] 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0
## [297] 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1
test_forest
## # A tibble: 320 × 14
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 0.998 9.4 3.51
## 2 7.3 0.65 0 0.995 10 3.39
## 3 8.9 0.22 0.48 0.997 9.4 3.39
## 4 7.6 0.41 0.24 0.996 9.5 3.28
## 5 7.1 0.71 0 0.997 9.4 3.47
## 6 5.7 1.13 0.09 0.994 9.8 3.5
## 7 7.3 0.45 0.36 0.998 10.5 3.33
## 8 8.1 0.66 0.22 0.997 10.3 3.3
## 9 6.8 0.67 0.02 0.996 9.5 3.48
## 10 5.6 0.31 0.37 0.995 9.2 3.32
## # … with 310 more rows, and 8 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>,
## # y_pred_probs2 <dbl>, y_pred2 <dbl>
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de Random Forest obtenido:
cm_test_forest <- confusionMatrix(as.factor(test_forest$y_pred2),
as.factor(test_forest$nota_vino), positive = "1")
cm_test_forest$table
## Reference
## Prediction 0 1
## 0 112 55
## 1 35 118
# result
cm_test_forest$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.72
cm_test_forest$byClass["Recall"] %>%
round(2)
## Recall
## 0.68
cm_test_forest$byClass["Precision"] %>%
round(2)
## Precision
## 0.77
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
# https://rubenfcasal.github.io/aprendizaje_estadistico/boosting-en-r.html
train_en <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_en
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_en$nota_vino)
##
## 0 1
## 597 682
str(train_en)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
# train_en <- train[, colnames(train)!='quality']
# train_en$nota_vino <- factor(train$quality < 6, labels =
# c('aprobado', 'suspenso')) # levels = c('FALSE', 'TRUE')
# str(train_en)
Creamos el modelo de boosting con una configuración inicial básica de parámetros:
ada.boost <- ada(as.factor(nota_vino) ~ ., data = train_en, type = "real",
control = rpart.control(maxdepth = 2, cp = 0, minsplit = 10, xval = 0),
iter = 150, nu = 0.05)
ada.boost
## Call:
## ada(as.factor(nota_vino) ~ ., data = train_en, type = "real",
## control = rpart.control(maxdepth = 2, cp = 0, minsplit = 10,
## xval = 0), iter = 150, nu = 0.05)
##
## Loss: exponential Method: real Iteration: 150
##
## Final Confusion Matrix for Data:
## Final Prediction
## True value 0 1
## 0 481 116
## 1 151 531
##
## Train Error: 0.209
##
## Out-Of-Bag Error: 0.223 iteration= 120
##
## Additional Estimates of number of iterations:
##
## train.err1 train.kap1
## 145 147
Vemos la evolución decreciente del error al aumentar el número de iteraciones en el modelo
plot(ada.boost)
Evaluamos la precisión del modelo en la muestra de train:
set.seed(123)
pred_ada <- predict(ada.boost, newdata = train_en)
caret::confusionMatrix(pred_ada, as.factor(train_en$nota_vino), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 481 151
## 1 116 531
##
## Accuracy : 0.7912
## 95% CI : (0.7679, 0.8132)
## No Information Rate : 0.5332
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5822
##
## Mcnemar's Test P-Value : 0.03746
##
## Sensitivity : 0.7786
## Specificity : 0.8057
## Pos Pred Value : 0.8207
## Neg Pred Value : 0.7611
## Prevalence : 0.5332
## Detection Rate : 0.4152
## Detection Prevalence : 0.5059
## Balanced Accuracy : 0.7921
##
## 'Positive' Class : 1
##
Con la configuración de parámetros realizada en el modelo ada de booting obtenemos un valor de accuracy del 78/79% para el caso de algoritmos de clasificación.
Para optimizar los resultados del modelo creado y la generalización del modelo, se puede realizar un ajuste de hiperparámetros y validación cruzada:
modelLookup("ada")
## model parameter label forReg forClass probModel
## 1 ada iter #Trees FALSE TRUE TRUE
## 2 ada maxdepth Max Tree Depth FALSE TRUE TRUE
## 3 ada nu Learning Rate FALSE TRUE TRUE
Vemos los parámetros de “iter”, “maxdepth” y “nu” que tiene el modelo ada de boosting para árboles de decisión en problemas de clasificación.
set.seed(123)
caret.ada <- train(as.factor(nota_vino) ~ ., method = "ada", data = train_en,
trControl = trainControl(method = "cv", number = 5))
caret.ada
## Boosted Classification Trees
##
## 1279 samples
## 11 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1024, 1023, 1022, 1024, 1023
## Resampling results across tuning parameters:
##
## maxdepth iter Accuracy Kappa
## 1 50 0.7443213 0.4890732
## 1 100 0.7584084 0.5160080
## 1 150 0.7576149 0.5133790
## 2 50 0.7568276 0.5121128
## 2 100 0.7529335 0.5043760
## 2 150 0.7591927 0.5167245
## 3 50 0.7599647 0.5189156
## 3 100 0.7607643 0.5206920
## 3 150 0.7568488 0.5125098
##
## Tuning parameter 'nu' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were iter = 100, maxdepth = 3 and nu = 0.1.
Obtenemos una configuración óptima de los hiperparámetros del modelo en “iter” = 100, “maxdepth” = 3 y “nu” = 0.1.
caret.ada
## Boosted Classification Trees
##
## 1279 samples
## 11 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1024, 1023, 1022, 1024, 1023
## Resampling results across tuning parameters:
##
## maxdepth iter Accuracy Kappa
## 1 50 0.7443213 0.4890732
## 1 100 0.7584084 0.5160080
## 1 150 0.7576149 0.5133790
## 2 50 0.7568276 0.5121128
## 2 100 0.7529335 0.5043760
## 2 150 0.7591927 0.5167245
## 3 50 0.7599647 0.5189156
## 3 100 0.7607643 0.5206920
## 3 150 0.7568488 0.5125098
##
## Tuning parameter 'nu' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were iter = 100, maxdepth = 3 and nu = 0.1.
caret.ada$bestTune
## iter maxdepth nu
## 8 100 3 0.1
Con el modelo de base obtenemos un accuracy del 80.3% con los datos de train.
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(caret.ada)
## nu maxdepth iter Accuracy Kappa AccuracySD KappaSD
## 1 0.1 3 100 0.7607643 0.520692 0.02106294 0.04217624
Evaluación del rendimiento predictivo del modelo Ada Boost presentado con las datos de train:
train_en$y_pred_probs2 <- predict(caret.ada, newdata = train_en,
type = "prob")
train_en$y_pred_probs2 <- ifelse(train_en$y_pred_probs2$`1` >
0.5, train_en$y_pred_probs2$`1`, 1 - train_en$y_pred_probs2$`0`)
train_en$y_pred2 <- ifelse(train_en$y_pred_probs2 > 0.5, 1, 0)
# train_forest$y_pred_probs2
train_en$y_pred2
## [1] 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 0 1
## [38] 0 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 1 1 0 0
## [75] 0 1 0 1 1 0 0 1 1 1 0 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 1 1
## [112] 1 0 0 0 0 1 0 1 0 0 1 0 0 1 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 1 1 0 0
## [149] 1 1 0 0 1 1 0 0 1 1 1 1 1 0 1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 1 1 0 0 1 1 1
## [186] 1 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 1 1 0 1 1 1
## [223] 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 0 1 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1
## [260] 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 0 1
## [297] 1 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 0 0 1 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0
## [334] 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 0 0 1 1 0
## [371] 1 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 0 0 0 1 1 1 1 1 1
## [408] 1 1 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 1 1 1 0 1 0 0 1 0 1 1
## [445] 1 1 1 0 1 1 1 1 0 1 0 1 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0
## [482] 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 0 1
## [519] 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1
## [556] 0 0 1 0 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1 0 0 1
## [593] 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1
## [630] 1 1 0 1 1 1 1 0 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1
## [667] 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 0 0 1 1 1 1 1 0 1 1 0 0 0 0
## [704] 1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0
## [741] 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 1 1
## [778] 1 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0
## [815] 1 0 1 1 0 0 1 0 0 0 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1
## [852] 0 0 1 1 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 0
## [889] 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1
## [926] 1 1 1 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 1 1 1 0 1
## [963] 0 0 1 1 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0
## [1000] 1 0 1 0 0 0 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1
## [1037] 1 1 0 0 0 0 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1
## [1074] 1 0 1 0 1 0 1 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 0 1 1 0
## [1111] 1 0 0 1 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0
## [1148] 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0 1 1 1 0 1 1 0 1 1 1
## [1185] 0 0 0 1 1 1 0 1 1 1 0 1 0 0 0 0 1 1 1 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 1 1
## [1222] 0 0 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1
## [1259] 1 1 0 0 1 1 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1
train_en
## # A tibble: 1,279 × 14
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 8 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>,
## # y_pred_probs2 <dbl>, y_pred2 <dbl>
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de Ada Boost obtenido:
cm_train_en <- confusionMatrix(as.factor(train_en$y_pred2), as.factor(train_en$nota_vino),
positive = "1")
cm_train_en$table
## Reference
## Prediction 0 1
## 0 479 135
## 1 118 547
# result
cm_train_en$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.8
cm_train_en$byClass["Recall"] %>%
round(2)
## Recall
## 0.8
cm_train_en$byClass["Precision"] %>%
round(2)
## Precision
## 0.82
Reproducimos la curva ROC sobre el modelo final de Ada Boost obtenido:
roc_ada <- plot.roc(as.numeric(train_en$nota_vino), as.numeric(train_en$y_pred_probs2))
auc(roc_ada)
## Area under the curve: 0.8912
En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
train_xgb <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_xgb
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_xgb$nota_vino)
##
## 0 1
## 597 682
str(train_xgb)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
# train_xgb <- train[, colnames(train)!='quality']
# train_xgb$nota_vino <- factor(train$quality < 6, labels =
# c('aprobado', 'suspenso')) # levels = c('FALSE', 'TRUE')
# str(train_xgb)
Para optimizar los resultados del modelo creado, se puede realizar un ajuste de hiperparámetros con validación cruzada:
modelLookup("xgbTree")
## model parameter label forReg forClass
## 1 xgbTree nrounds # Boosting Iterations TRUE TRUE
## 2 xgbTree max_depth Max Tree Depth TRUE TRUE
## 3 xgbTree eta Shrinkage TRUE TRUE
## 4 xgbTree gamma Minimum Loss Reduction TRUE TRUE
## 5 xgbTree colsample_bytree Subsample Ratio of Columns TRUE TRUE
## 6 xgbTree min_child_weight Minimum Sum of Instance Weight TRUE TRUE
## 7 xgbTree subsample Subsample Percentage TRUE TRUE
## probModel
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
## 7 TRUE
Creamos el modelo de boosting con una configuración inicial dada de parámetros:
Obtenemos una configuración óptima de los hiperparámetros del modelo en:
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(caret.xgb)
## eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
## 1 0.3 3 0 0.6 1 1 150
## Accuracy Kappa AccuracySD KappaSD
## 1 0.7967279 0.5917263 0.0189888 0.03670645
Vemos la relevancia de cada variable en el modelo:
varImp(caret.xgb)
## xgbTree variable importance
##
## Overall
## alcohol 100.000
## Log_sulphates 65.538
## volatile_acidity 52.167
## density 39.356
## Log_total_sulfur_dioxide 39.330
## Log_chlorides 27.923
## pH 12.213
## citric_acid 4.803
## Log_free_sulfur_dioxide 4.607
## fixed_acidity 2.691
## Log_residual_sugar 0.000
Evaluación del rendimiento predictivo del modelo Ada Boost presentado con las datos de train:
train_xgb$y_pred_probs2 <- predict(caret.xgb, newdata = train_xgb,
type = "prob")
train_xgb$y_pred_probs2 <- ifelse(train_xgb$y_pred_probs2$`1` >
0.5, train_xgb$y_pred_probs2$`1`, 1 - train_xgb$y_pred_probs2$`0`)
train_xgb$y_pred2 <- ifelse(train_xgb$y_pred_probs2 > 0.5, 1,
0)
# train_forest$y_pred_probs2
train_xgb$y_pred2
## [1] 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 0 1 1
## [38] 1 1 1 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 1 0 0 1 0 1 1 0 0 0
## [75] 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 1 1 0 0 1 1
## [112] 0 1 0 1 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0
## [149] 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 1 1 1
## [186] 1 1 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 1 0 0 0 1 1 1 0 0 1 1 1
## [223] 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 0 1 1
## [260] 0 1 1 0 1 0 0 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1
## [297] 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0
## [334] 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0
## [371] 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1 1 0 1
## [408] 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 0 1 1 0 1 1 1 1 0 1 0 1 0 0 1 0 1 1
## [445] 1 1 0 0 1 1 1 1 0 1 0 1 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 0
## [482] 0 0 1 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 1
## [519] 0 1 1 1 0 0 0 1 0 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1
## [556] 0 0 1 0 0 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 1 0 1 1 0 0 1 1 1 1 0 0
## [593] 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1
## [630] 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1
## [667] 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0 0 0 0
## [704] 1 1 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0
## [741] 1 0 1 0 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 0 1 0 0 0 1 1
## [778] 1 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 0 1
## [815] 0 0 1 1 0 0 1 0 0 0 1 0 0 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1
## [852] 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 0 1 0 0 1 1 0 0
## [889] 0 0 1 0 1 0 1 0 1 1 0 0 0 0 0 1 1 0 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 0 0 1
## [926] 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 0
## [963] 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 0 1 1 0 1 1 1 1 1 1 0 0 1 1 0 0 0 1 1
## [1000] 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1
## [1037] 0 0 0 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1
## [1074] 1 0 0 0 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0
## [1111] 0 1 0 1 0 1 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0
## [1148] 0 0 1 0 0 1 1 0 1 0 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0 1 1 1 0 1 1 0 1 0 0
## [1185] 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 1 0 1 0 1 1 0 0 0 1 1 1
## [1222] 0 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 1 0 1 1 1 1 0 1 1 1 1
## [1259] 1 1 0 1 1 1 1 0 0 1 1 1 1 1 0 1 0 0 0 0 1
train_xgb
## # A tibble: 1,279 × 14
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 8 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>,
## # y_pred_probs2 <dbl>, y_pred2 <dbl>
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de Ada Boost obtenido:
cm_train_xgb <- confusionMatrix(as.factor(train_xgb$y_pred2),
as.factor(train_xgb$nota_vino), positive = "1")
cm_train_xgb$table
## Reference
## Prediction 0 1
## 0 584 15
## 1 13 667
# result
cm_train_xgb$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.98
cm_train_xgb$byClass["Recall"] %>%
round(2)
## Recall
## 0.98
cm_train_xgb$byClass["Precision"] %>%
round(2)
## Precision
## 0.98
Reproducimos la curva ROC sobre el modelo final de Ada Boost obtenido:
roc_ada <- plot.roc(as.numeric(train_en$nota_vino), as.numeric(train_en$y_pred_probs2))
auc(roc_ada)
## Area under the curve: 0.8912
Utilizamos el paquete kernlab para crear nuestro algoritmo SVM y entrenamos nuestro modelo con la función train() del paquete carret.
En nuestro dataset de train y test, hemos creado la variable binaria “nota_vino”para que, en función de “quality,nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).
#train_svm <- train[, colnames(train)!="quality"]
#train_svm$nota_vino <- factor(train$quality < 6, labels = c('aprobado', 'suspenso')) # levels = c('FALSE', 'TRUE')
#train_svm
train_svm <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_svm
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_svm$nota_vino)
##
## 0 1
## 597 682
str(train_svm)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
modelLookup('svmLinear')
## model parameter label forReg forClass probModel
## 1 svmLinear C Cost TRUE TRUE TRUE
modelLookup('svmRadial')
## model parameter label forReg forClass probModel
## 1 svmRadial sigma Sigma TRUE TRUE TRUE
## 2 svmRadial C Cost TRUE TRUE TRUE
set.seed(5555)
# Fit the model
caret.svm <- train(as.factor(nota_vino) ~., data = train_svm,
method = "svmLinear",
trControl = trainControl("cv", number = 5, search = "grid"),
preProcess = c("center","scale"),
tuneGrid = expand.grid(C = seq(0.1, 2, length = 20)))
#View the model
caret.svm
## Support Vector Machines with Linear Kernel
##
## 1279 samples
## 11 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1023, 1024, 1022, 1024, 1023
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.1 0.7521827 0.5055023
## 0.2 0.7545265 0.5098180
## 0.3 0.7545295 0.5098578
## 0.4 0.7537452 0.5082373
## 0.5 0.7529609 0.5066244
## 0.6 0.7545234 0.5097268
## 0.7 0.7545234 0.5097268
## 0.8 0.7545234 0.5097268
## 0.9 0.7560890 0.5129537
## 1.0 0.7560890 0.5129537
## 1.1 0.7560890 0.5129537
## 1.2 0.7553047 0.5114252
## 1.3 0.7553047 0.5114252
## 1.4 0.7553047 0.5114252
## 1.5 0.7553047 0.5114252
## 1.6 0.7553047 0.5114252
## 1.7 0.7553047 0.5114252
## 1.8 0.7553047 0.5114252
## 1.9 0.7545204 0.5098984
## 2.0 0.7553047 0.5114252
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.9.
plot(caret.svm)
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(caret.svm)
## C Accuracy Kappa AccuracySD KappaSD
## 1 0.9 0.756089 0.5129537 0.03569907 0.06917993
train_svm$y_pred_probs2 <- predict(caret.svm, newdata = train_svm)
# train_svm$y_pred_probs2 <-
# ifelse(train_svm$y_pred_probs2$`1` > 0.5,
# train_svm$y_pred_probs2$`1`,
# 1-train_svm$y_pred_probs2$`0`)
# train_svm$y_pred2 <- ifelse(train_svm$y_pred_probs2 >
# 0.5, 1, 0)
train_svm$y_pred_probs2
## [1] 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 0 1 1 0 0 1
## [38] 0 1 0 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 1 1 0 0
## [75] 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 0 1 0 0 1 0 1 0 0 1 1
## [112] 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 1 0 1 1 0 0
## [149] 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 1 1 0 0 0 1 1
## [186] 1 1 0 1 0 0 1 0 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 1 1 0 1 1 1
## [223] 0 1 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 0 1 0 1 1
## [260] 0 0 0 1 0 0 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0 0 1 0 1
## [297] 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 0 0 1 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 0
## [334] 0 0 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 0
## [371] 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0 1
## [408] 1 1 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 1 0 1 1
## [445] 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 0 1 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0
## [482] 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 1
## [519] 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 0 1 1 1
## [556] 0 0 1 0 0 0 0 1 0 0 1 1 1 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1
## [593] 1 0 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 0 1
## [630] 1 1 0 1 1 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 1 1 0 1
## [667] 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 1 1 1 0 1 0 0
## [704] 1 0 0 1 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [741] 1 0 1 0 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 0 1 1 1 1 0 1 0 0 1 0 1 1 0 0 1 0
## [778] 1 0 0 1 1 1 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0
## [815] 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 1
## [852] 0 0 1 1 0 1 0 0 0 0 1 0 0 1 1 1 0 1 1 1 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 0
## [889] 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1
## [926] 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 0 1 1 0 1
## [963] 0 0 1 1 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 0
## [1000] 1 0 1 0 0 0 0 1 0 1 1 0 1 0 1 1 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1
## [1037] 1 1 0 0 0 0 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1 1
## [1074] 0 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 1 1 1 0 1 0 1 0 1 1 1
## [1111] 1 0 0 1 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [1148] 0 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 0
## [1185] 0 0 0 1 1 1 0 1 1 1 0 1 0 0 0 0 1 1 1 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 1 1
## [1222] 0 0 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1
## [1259] 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 0
## Levels: 0 1
# train_svm train_svm$y_pred2
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de SVM obtenido:
cm_train_svm <- confusionMatrix(as.factor(train_svm$y_pred_probs2),
as.factor(train_svm$nota_vino), positive = "1")
cm_train_svm$table
## Reference
## Prediction 0 1
## 0 465 181
## 1 132 501
# result
cm_train_svm$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.76
cm_train_svm$byClass["Recall"] %>%
round(2)
## Recall
## 0.73
cm_train_svm$byClass["Precision"] %>%
round(2)
## Precision
## 0.79
Reproducimos la curva ROC sobre el modelo final de Decision Tree obtenido:
roc_svm <- plot.roc(as.numeric(train_svm$nota_vino), as.numeric(train_svm$y_pred_probs2))
auc(roc_svm)
## Area under the curve: 0.7567
modelLookup('svmRadial')
## model parameter label forReg forClass probModel
## 1 svmRadial sigma Sigma TRUE TRUE TRUE
## 2 svmRadial C Cost TRUE TRUE TRUE
set.seed(6666)
# Fit the model
caret.svm.radial <- train(as.factor(nota_vino) ~., data = train_svm,
method = "svmRadial",
trControl = trainControl("cv", number = 5, search = "grid"),
preProcess = c("center","scale"),
tuneLength = 20)
#View the model
caret.svm.radial
## Support Vector Machines with Radial Basis Function Kernel
##
## 1279 samples
## 12 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (12), scaled (12)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1023, 1024, 1023, 1023, 1023
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.7545129 0.5096100
## 0.50 0.7568566 0.5142522
## 1.00 0.7646599 0.5304827
## 2.00 0.7669945 0.5355742
## 4.00 0.7615257 0.5244360
## 8.00 0.7544884 0.5099085
## 16.00 0.7584130 0.5171537
## 32.00 0.7482322 0.4960094
## 64.00 0.7325888 0.4639937
## 128.00 0.7247580 0.4483362
## 256.00 0.7083180 0.4161348
## 512.00 0.7185049 0.4355438
## 1024.00 0.7200827 0.4384403
## 2048.00 0.7177328 0.4335603
## 4096.00 0.7169547 0.4315018
## 8192.00 0.7114828 0.4206549
## 16384.00 0.7099203 0.4175208
## 32768.00 0.7099203 0.4175208
## 65536.00 0.7099203 0.4175208
## 131072.00 0.7099203 0.4175208
##
## Tuning parameter 'sigma' was held constant at a value of 0.07591253
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.07591253 and C = 2.
plot(caret.svm.radial)
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
get_best_result(caret.svm.radial)
## sigma C Accuracy Kappa AccuracySD KappaSD
## 1 0.07591253 2 0.7669945 0.5355742 0.01186896 0.02335423
train_svm$y_pred_probs2 <- predict(caret.svm.radial, newdata = train_svm)
# train_svm$y_pred_probs2 <-
# ifelse(train_svm$y_pred_probs2$`1` > 0.5,
# train_svm$y_pred_probs2$`1`,
# 1-train_svm$y_pred_probs2$`0`)
# train_svm$y_pred2 <- ifelse(train_svm$y_pred_probs2 >
# 0.5, 1, 0)
train_svm$y_pred_probs2
## [1] 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1 0 0 1
## [38] 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 0 1 1 0 0 0
## [75] 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 0 1 0 0 0 0 1 0 0 1 1
## [112] 1 1 0 0 0 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 1 0 1 1 0 0
## [149] 1 1 0 0 1 1 0 0 1 1 1 1 1 0 1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 1 1 1 0 0 1 1
## [186] 1 1 0 1 0 0 1 0 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 1 1 0 1 1 1
## [223] 0 0 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 0 1 0 1 1
## [260] 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 0 1 1 0 0 1 0 1
## [297] 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 0
## [334] 0 0 0 1 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0
## [371] 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1 1 0 0
## [408] 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 1 0 1 1
## [445] 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0
## [482] 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 1
## [519] 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1
## [556] 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 0 1
## [593] 1 1 0 1 0 0 0 0 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1
## [630] 1 1 0 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 1 1 0 1
## [667] 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0 0 1 1 0 1 0 1 0 1 1 0 0 0 0
## [704] 1 0 0 1 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [741] 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 0 1 1 1 1 0 1 0 0 1 0 1 1 0 0 1 1
## [778] 1 0 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0
## [815] 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 1 0 1 1
## [852] 0 0 1 1 0 1 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 0
## [889] 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1
## [926] 1 1 1 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 1
## [963] 0 0 1 1 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 1 1 0
## [1000] 1 0 1 0 0 0 0 1 0 1 1 0 1 0 1 1 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1
## [1037] 1 1 0 0 0 0 1 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1 1
## [1074] 0 0 1 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1
## [1111] 1 1 0 1 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1148] 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 0 1 0 1 1 0 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 0
## [1185] 0 0 0 1 1 1 0 1 0 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 1 0 0 1 0 0 0 1 1 1
## [1222] 0 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1
## [1259] 1 1 0 1 1 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 0
## Levels: 0 1
# train_svm train_svm$y_pred2
Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de SVM obtenido:
cm_train_svm_radial <- confusionMatrix(as.factor(train_svm$y_pred_probs2),
as.factor(train_svm$nota_vino), positive = "1")
cm_train_svm_radial$table
## Reference
## Prediction 0 1
## 0 505 158
## 1 92 524
# result
cm_train_svm_radial$overall["Accuracy"] %>%
round(2)
## Accuracy
## 0.8
cm_train_svm_radial$byClass["Recall"] %>%
round(2)
## Recall
## 0.77
cm_train_svm_radial$byClass["Precision"] %>%
round(2)
## Precision
## 0.85
Reproducimos la curva ROC sobre el modelo final de SVM obtenido:
roc_svm_radial <- plot.roc(as.numeric(train_svm$nota_vino), as.numeric(train_svm$y_pred_probs2))
auc(roc_svm_radial)
## Area under the curve: 0.8071
Creamos un modelo SVM Lineal con todos los predictores de nuestro data set.
#set.seed(13)
#modelo_svmlineal <- train(nota_vino ~ ., method = "svmLinear", data = train_svm)
#modelo_svmlineal$finalModel
Con el objeto finalModel podemos observar cuales son los hiperparámetros utilizados (cost C) y el error de entrenamiento. Este error se corresponde con el error que comete mi modelo al intentar predicir las mismas observaciones con las que se ha entrenado. En nuestro caso tenemos un 24,44% de error.
Intentamos ajustar y evaluar nuestro modelo múltiples veces con distintos subconjuntos creados a partir de los datos de entrenamiento mediante Cross Validation, obteniendo para cada repetición una estimación del error. Cuando se aplican estos métodos, el coste computacional de ajustar múltiplas veces un modelo es alto y por eso con caret, podemos paralelizar el proceso para que sea más rápido.
#paralelización
#registerDoMC(cores = 4)
#número de repeticiones para realizar la validación cruzada
#particiones <- 10
#repeticiones <- 5
#modelo
#control_modelosvm_lineal <- trainControl(method = "repeatedcv", number = particiones,repeats = repeticiones,returnResamp = "all", verboseIter = FALSE,allowParallel = TRUE)
#set.seed(342)
#modelo_svmlineal <- train(nota_vino ~ ., data = train_svm,
# method = "svmLinear",
# metric = "Accuracy",
# trControl = control_modelosvm_lineal)
#modelo_svmlineal
Con esta validación cruzada con 10 particiones y 5 repeticiones hemos ajustado nuestro modelo 50 veces. Podemos pintar una gráfica con el accuracy obtenido en cada uno de estos modelos.
#grafo1 <- ggplot(data = modelo_svmlineal$resample, aes(x = Accuracy)) + geom_density(alpha = 0.5, fill = "blue") +geom_vline(xintercept = mean(modelo_svmlineal$resample$Accuracy),linetype = "dashed") + theme_bw()
#grafo2 <- ggplot(data = modelo_svmlineal$resample, aes(x = 1, y = Accuracy)) +geom_boxplot(outlier.shape = NA, alpha = 0.5, fill = "blue") +
# geom_jitter(width = 0.05) +labs(x = "") +theme_bw() + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
#final_plot_svm <- ggarrange(grafo1, grafo2)
#final_plot_svm <- annotate_figure(final_plot_svm,top = text_grob("Accuracy obtenido en la validación", size = 15))
#final_plot_svm
Con validación cruzada llegamos a un accuracy promedio de 0.7513. Eso es lo mismo que decir que, mi modelo SVM Lineal predice si un vino es bueno o malo 75% de las veces.
Como hemos visto, nuestro modelo svmLinear tiene un hiperparámetro llamado coste (C). volveremos a ajustar nuestro modelo con diferentes modelos de C y aplicamos validación cruzada otra vez, para volver a identificar en cuál de los submodelos se obtiene el mejor resultado.
Para los diferentes valores de C, hemos elegido trabajar con grid search donde se especifican los valores exactos de los hiperparámetros.
#paralelización
#registerDoMC(cores = 4)
#hiperparámetros y número de repeticiones
#particiones <- 10
#repeticiones <- 5
#hiperparametros <- data.frame(C = c(0.001, 0.01, 0.1, 0.5, 1, 10))
#modelo
#control_modelosvm_lineal <- trainControl(method = "repeatedcv", number = particiones,repeats = repeticiones,returnResamp = "all", verboseIter =FALSE,allowParallel = TRUE)
#set.seed(342)
#modelo_svmlineal <- train(nota_vino ~ ., data = train_svm,method = "svmLinear",tuneGrid = hiperparametros,metric = "Accuracy",trControl =control_modelosvm_lineal)
#modelo_svmlineal
Podemos observar que en nuestro caso de todos los valores de C utilizados, C=0.01 es lo que nos devuelve los mejores resultados con un accuracy de 0.7526.
Observamos la variación la variación de nuestros accuracy para cada uno de los valores de C en el grafo abajo.
#ggplot(data = modelo_svmlineal$resample,aes(x = as.factor(C), y = Accuracy, color = as.factor(C))) +geom_boxplot(outlier.shape = NA, alpha = 0.6) +
# geom_jitter(width = 0.2, alpha = 0.6) + geom_hline(yintercept = 0.62, linetype = "dashed") +labs(x = "C") + theme_bw() + theme(legend.position = "none")
También podemos observar la evolución de los modelos según los valores de hiperparámetros que hemos elegido.
#ggplot(modelo_svmlineal, highlight = TRUE) +labs(title = "Evolución accuracy en función de C") + theme_bw()
train_km <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_km
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_km$nota_vino)
##
## 0 1
## 597 682
str(train_km)
## tibble [1,279 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
## $ volatile_acidity : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
## $ citric_acid : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
## $ density : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
## $ alcohol : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
## $ pH : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
## $ Log_residual_sugar : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
## $ Log_chlorides : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
## $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
## $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
## $ Log_sulphates : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
## $ nota_vino : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
Quitamos la variable respuesta “quality”:
train_kmeans <- train[, -6]
train_kmeans
## # A tibble: 1,279 × 11
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 5 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>
train_kmeans_s<-scale(train_kmeans, center = TRUE, scale = TRUE)
#train_kmeans_s
Buscamos el valor óptimo de cluster a tener en nuestro modelo:
fviz_nbclust(train_kmeans_s, kmeans, method = "wss")
Otra forma de buscar el óptimo:
wholesaleBest = FitKMeans(train_kmeans_s, max.clusters = 10, nstart = 25, seed = 666)
wholesaleBest
## Clusters Hartigan AddCluster
## 1 2 294.35525 TRUE
## 2 3 162.77234 TRUE
## 3 4 136.64204 TRUE
## 4 5 100.50144 TRUE
## 5 6 89.96797 TRUE
## 6 7 73.44179 TRUE
## 7 8 60.25836 TRUE
## 8 9 54.44277 TRUE
## 9 10 41.22175 TRUE
PlotHartigan(wholesaleBest)
#calculate gap statistic based on number of clusters
gap_stat <- clusGap(train_kmeans_s,
FUN = kmeans,
nstart = 25,
K.max = 10,
B = 50)
#plot number of clusters vs. gap statistic
fviz_gap_stat(gap_stat)
Desarrollamos el Clustering con K-means con el número óptimo de K:
#make this example reproducible
set.seed(666)
km <- kmeans(train_kmeans_s, centers = 3, nstart = 25)
#view results
km
## K-means clustering with 3 clusters of sizes 431, 372, 476
##
## Cluster means:
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## 1 -0.1147168 0.07994522 -0.04724418 0.2777948 -0.5091775 -0.06382814
## 2 1.0549088 -0.66861546 1.05709021 0.4879936 0.2866082 -0.75490871
## 3 -0.7205528 0.45014403 -0.78335151 -0.6329059 0.2370531 0.64776464
## Log_residual_sugar Log_chlorides Log_free_sulfur_dioxide
## 1 0.30242179 0.08195376 0.7733582
## 2 0.07797989 0.31218734 -0.6125148
## 3 -0.33477376 -0.31818437 -0.2215585
## Log_total_sulfur_dioxide Log_sulphates
## 1 0.9572131 -0.1234124
## 2 -0.5725535 0.5785520
## 3 -0.4192625 -0.3404004
##
## Clustering vector:
## [1] 3 1 3 1 1 3 2 1 3 3 1 2 1 1 2 1 2 1 3 2 2 2 2 1 3 2 1 2 3 2 3 1 3 2 3 1 3
## [38] 1 2 2 2 1 3 2 1 3 2 1 3 1 3 1 2 3 3 2 3 1 2 2 2 3 1 2 2 3 3 2 1 3 1 2 1 1
## [75] 1 1 1 3 2 1 2 3 2 1 1 1 2 1 3 3 1 3 3 3 1 3 3 2 2 2 1 2 3 3 3 3 1 3 1 2 2
## [112] 1 2 1 3 1 2 1 1 1 1 1 1 3 3 2 1 3 2 2 2 1 2 2 1 3 3 3 3 2 1 3 1 2 2 3 3 3
## [149] 2 2 3 3 2 3 3 3 2 2 3 2 2 3 2 3 3 1 3 2 1 1 1 2 1 3 3 2 1 3 2 3 3 1 3 3 2
## [186] 2 2 1 2 3 1 2 3 3 1 3 3 2 3 2 1 2 1 3 2 2 1 2 3 1 1 3 2 2 1 2 2 2 1 2 2 1
## [223] 1 1 2 3 1 3 1 1 2 3 1 1 3 1 1 2 2 1 3 3 3 3 3 1 1 2 2 1 3 2 2 2 3 3 3 3 2
## [260] 1 3 3 3 1 3 3 3 1 3 2 1 3 3 3 2 2 2 3 2 3 1 2 3 2 2 3 3 3 3 3 3 2 3 2 3 3
## [297] 2 3 1 1 2 3 3 1 2 2 2 1 1 3 3 3 1 3 2 2 1 3 3 1 3 1 2 3 3 2 2 1 2 2 2 3 1
## [334] 1 3 2 2 2 2 1 1 3 2 1 2 3 1 3 2 3 1 1 1 3 2 1 1 1 3 1 1 2 2 3 2 1 2 1 2 1
## [371] 3 1 2 3 3 1 3 3 3 1 3 2 2 1 2 2 1 1 2 2 2 2 3 3 1 3 1 2 3 3 1 3 2 3 2 1 3
## [408] 3 3 1 3 1 1 1 2 2 3 3 3 3 1 2 2 1 3 3 2 2 1 3 2 2 1 2 3 2 1 3 2 1 3 2 2 3
## [445] 3 2 2 3 3 2 3 2 1 3 1 2 3 1 1 1 2 1 1 1 2 2 2 2 1 3 3 1 2 3 3 1 3 3 1 3 2
## [482] 1 1 2 1 3 2 3 2 3 3 2 3 1 2 1 3 2 1 2 3 3 2 3 2 2 3 1 1 3 2 3 1 2 2 2 1 2
## [519] 3 3 1 2 1 3 3 3 3 1 2 3 3 3 1 1 2 1 2 3 3 3 1 3 2 3 3 3 3 3 3 3 3 2 3 2 3
## [556] 1 1 2 3 3 3 3 1 3 3 2 2 3 1 2 2 1 1 3 3 3 3 1 3 2 2 1 2 1 1 1 3 1 1 3 3 2
## [593] 3 2 3 2 1 1 1 1 2 1 2 3 3 3 3 2 2 3 2 3 3 2 1 1 2 3 3 3 2 3 1 3 2 1 3 3 2
## [630] 3 2 1 2 2 2 3 1 2 1 2 1 3 3 2 1 3 2 1 3 3 1 3 1 1 2 2 1 2 1 3 2 3 2 1 1 2
## [667] 3 1 3 1 3 3 3 3 3 1 3 3 2 2 3 3 2 1 3 3 3 3 1 1 2 3 3 2 3 1 1 3 3 1 1 1 1
## [704] 2 1 3 2 3 1 2 1 3 3 3 2 3 2 3 2 3 2 2 1 2 2 1 1 1 1 1 3 1 3 3 1 1 1 3 2 1
## [741] 2 1 1 1 2 3 1 3 1 2 3 2 1 1 2 1 3 3 2 3 1 3 3 1 3 2 1 1 1 2 1 2 2 3 1 2 3
## [778] 2 1 2 3 2 1 3 3 2 3 1 3 2 1 3 1 2 3 3 1 2 2 1 2 3 3 3 1 3 2 1 3 2 3 2 3 1
## [815] 3 3 1 3 1 2 1 2 3 3 3 3 3 1 2 1 3 2 1 3 1 1 2 1 3 1 3 3 1 2 1 1 2 3 1 3 3
## [852] 1 1 2 2 2 3 1 3 2 1 2 1 1 3 1 3 3 1 1 3 2 3 1 2 1 1 3 3 2 1 3 1 1 3 3 2 1
## [889] 1 3 3 1 3 2 2 1 3 2 1 3 1 1 1 3 2 3 3 1 2 1 2 3 1 1 1 1 1 2 2 3 2 1 3 1 1
## [926] 2 2 2 3 3 1 2 1 1 3 1 1 2 2 3 1 3 1 3 3 1 1 1 2 2 3 1 3 1 2 1 3 2 3 3 3 2
## [963] 1 1 1 3 3 2 2 1 2 1 3 2 1 2 1 1 1 2 1 1 2 1 3 3 3 3 3 2 3 2 3 1 1 3 2 3 3
## [1000] 2 3 1 1 1 1 3 2 1 2 2 3 1 1 2 1 1 3 1 2 1 3 3 3 2 1 1 1 3 2 3 3 3 1 1 1 2
## [1037] 2 1 1 1 3 3 2 1 3 1 2 1 1 2 3 3 1 2 3 1 3 1 2 1 3 1 3 3 2 3 2 1 2 3 3 1 1
## [1074] 3 1 2 3 3 1 3 2 1 3 2 1 1 1 2 2 3 3 3 3 3 1 3 2 1 1 1 1 2 3 3 3 3 1 2 3 2
## [1111] 3 2 1 2 1 1 2 3 3 2 2 2 3 3 3 3 1 1 2 2 1 2 1 1 3 1 1 3 3 3 1 1 2 1 1 3 1
## [1148] 1 3 2 3 1 3 3 3 2 1 1 3 1 3 2 1 3 1 3 2 1 1 2 1 3 3 1 3 2 3 1 2 2 1 3 3 3
## [1185] 3 1 3 1 3 2 3 2 2 3 3 2 1 3 1 1 3 3 1 3 2 2 2 2 1 3 3 1 1 1 3 1 1 1 2 2 3
## [1222] 3 1 3 1 3 3 2 2 1 2 1 1 3 2 2 1 3 1 3 3 3 3 3 3 1 3 3 3 3 1 2 1 1 2 2 3 2
## [1259] 2 2 1 1 2 3 1 1 3 2 1 3 1 2 2 1 1 3 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 2830.850 3715.966 3585.265
## (between_SS / total_SS = 27.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Graficamos los resultados obtenidos:
fviz_cluster(km, data = train_kmeans_s, geom = "point")
Vemos la media de los valores para cada uno de los diferentes clusters y su tamaño:
aggregate(train_kmeans_s, by=list(cluster=km$cluster), mean)
## cluster fixed_acidity volatile_acidity citric_acid density alcohol
## 1 1 -0.1147168 0.07994522 -0.04724418 0.2777948 -0.5091775
## 2 2 1.0549088 -0.66861546 1.05709021 0.4879936 0.2866082
## 3 3 -0.7205528 0.45014403 -0.78335151 -0.6329059 0.2370531
## pH Log_residual_sugar Log_chlorides Log_free_sulfur_dioxide
## 1 -0.06382814 0.30242179 0.08195376 0.7733582
## 2 -0.75490871 0.07797989 0.31218734 -0.6125148
## 3 0.64776464 -0.33477376 -0.31818437 -0.2215585
## Log_total_sulfur_dioxide Log_sulphates
## 1 0.9572131 -0.1234124
## 2 -0.5725535 0.5785520
## 3 -0.4192625 -0.3404004
km$size
## [1] 431 372 476
table(km$cluster, train$quality)
##
## 3 4 5 6 7 8
## 1 0 8 262 139 22 0
## 2 2 5 98 171 88 8
## 3 5 25 192 203 46 5
table(km$cluster, train_km$nota_vino)
##
## 0 1
## 1 270 161
## 2 105 267
## 3 222 254
kmean_model_cluster <- as.data.frame(km$cluster)
#names(kmean_model_cluster)[1] <- 'class'
kmean_model_cluster$nota_vino <- as.factor(ifelse(kmean_model_cluster=='1','1','0'))
kmean_model_cluster$nota_vino
## [1] 0 1 0 1 1 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0
## [38] 1 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 1
## [75] 1 1 1 0 0 1 0 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0
## [112] 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0
## [186] 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1
## [223] 1 1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0
## [260] 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1
## [334] 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 1 0 1 0 1
## [371] 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0
## [408] 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 1 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0
## [482] 1 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0
## [519] 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [556] 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 0 0 0
## [593] 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0
## [630] 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 1 1 0
## [667] 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 1 1 1 1
## [704] 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 0 1 0 0 1 1 1 0 0 1
## [741] 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 1 0 0
## [778] 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1
## [815] 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 1 0 0
## [852] 1 1 0 0 0 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 0 0 1
## [889] 1 0 0 1 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 0 1 1
## [926] 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0
## [963] 1 1 1 0 0 0 0 1 0 1 0 0 1 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0
## [1000] 0 0 1 1 1 1 0 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0
## [1037] 0 1 1 1 0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 1 1
## [1074] 0 1 0 0 0 1 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0
## [1111] 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 1
## [1148] 1 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0
## [1185] 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 1 1 1 0 0 0
## [1222] 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0
## [1259] 0 0 1 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 1 1 1
## Levels: 0 1
#train_km$nota_vino
#kmean_model_cluster$nota_vino
str(train_km$nota_vino)
## num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
str(kmean_model_cluster$nota_vino)
## Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 2 1 1 ...
confusionMatrix(as.factor(train_km$nota_vino), kmean_model_cluster$nota_vino)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 327 270
## 1 521 161
##
## Accuracy : 0.3815
## 95% CI : (0.3548, 0.4088)
## No Information Rate : 0.663
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.2107
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.3856
## Specificity : 0.3735
## Pos Pred Value : 0.5477
## Neg Pred Value : 0.2361
## Prevalence : 0.6630
## Detection Rate : 0.2557
## Detection Prevalence : 0.4668
## Balanced Accuracy : 0.3796
##
## 'Positive' Class : 0
##
¿Será con variable respuesta continua?
gam_mod <- gam(quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) + s(citric_acid) +
s(fixed_acidity), data = train, method = "REML")
gam_mod
##
## Family: gaussian
## Link function: identity
##
## Formula:
## quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) +
## s(citric_acid) + s(fixed_acidity)
##
## Estimated degrees of freedom:
## 3.80 1.00 3.69 1.00 2.25 2.94 1.99
## 2.55 total = 20.23
##
## REML score: 1221.011
coef(gam_mod)
## (Intercept) s(alcohol).1
## 5.634871e+00 1.648020e-01
## s(alcohol).2 s(alcohol).3
## -1.165238e-01 7.041699e-02
## s(alcohol).4 s(alcohol).5
## -9.552495e-02 -4.469214e-02
## s(alcohol).6 s(alcohol).7
## 1.131493e-01 -4.810090e-02
## s(alcohol).8 s(alcohol).9
## 6.033374e-01 8.274343e-02
## s(volatile_acidity).1 s(volatile_acidity).2
## -4.570556e-05 1.038005e-06
## s(volatile_acidity).3 s(volatile_acidity).4
## -1.351067e-05 -1.793403e-05
## s(volatile_acidity).5 s(volatile_acidity).6
## 1.458357e-05 1.751913e-05
## s(volatile_acidity).7 s(volatile_acidity).8
## -1.271810e-05 1.597256e-04
## s(volatile_acidity).9 s(Log_sulphates).1
## -1.701127e-01 -7.829302e-03
## s(Log_sulphates).2 s(Log_sulphates).3
## -7.099365e-02 4.537419e-02
## s(Log_sulphates).4 s(Log_sulphates).5
## -3.268186e-02 -4.713404e-02
## s(Log_sulphates).6 s(Log_sulphates).7
## 1.907383e-02 1.687056e-02
## s(Log_sulphates).8 s(Log_sulphates).9
## 1.915352e-01 1.288852e-01
## s(Log_chlorides).1 s(Log_chlorides).2
## 8.907392e-06 -3.929104e-07
## s(Log_chlorides).3 s(Log_chlorides).4
## -4.775713e-06 -2.441926e-06
## s(Log_chlorides).5 s(Log_chlorides).6
## 2.867426e-06 5.221761e-07
## s(Log_chlorides).7 s(Log_chlorides).8
## -2.957613e-06 -1.322006e-05
## s(Log_chlorides).9 s(pH).1
## -6.951358e-02 -9.276906e-03
## s(pH).2 s(pH).3
## 1.043393e-02 -9.687645e-03
## s(pH).4 s(pH).5
## -1.755619e-02 2.316811e-03
## s(pH).6 s(pH).7
## 1.786883e-02 9.528146e-03
## s(pH).8 s(pH).9
## 1.380056e-01 -9.893089e-02
## s(Log_total_sulfur_dioxide).1 s(Log_total_sulfur_dioxide).2
## 1.264371e-01 -7.861424e-03
## s(Log_total_sulfur_dioxide).3 s(Log_total_sulfur_dioxide).4
## 3.959153e-02 -2.883118e-02
## s(Log_total_sulfur_dioxide).5 s(Log_total_sulfur_dioxide).6
## -3.644112e-02 3.414453e-02
## s(Log_total_sulfur_dioxide).7 s(Log_total_sulfur_dioxide).8
## 2.681938e-02 1.658594e-01
## s(Log_total_sulfur_dioxide).9 s(citric_acid).1
## 4.615124e-02 -8.294107e-03
## s(citric_acid).2 s(citric_acid).3
## 2.156461e-02 -8.422364e-03
## s(citric_acid).4 s(citric_acid).5
## 1.810441e-02 7.342591e-03
## s(citric_acid).6 s(citric_acid).7
## -2.160911e-02 7.256404e-03
## s(citric_acid).8 s(citric_acid).9
## -1.070232e-01 -3.440155e-02
## s(fixed_acidity).1 s(fixed_acidity).2
## -3.380473e-03 -4.880922e-02
## s(fixed_acidity).3 s(fixed_acidity).4
## -2.163905e-02 -4.395839e-02
## s(fixed_acidity).5 s(fixed_acidity).6
## 1.957378e-02 -4.981481e-02
## s(fixed_acidity).7 s(fixed_acidity).8
## -2.147517e-02 2.926902e-01
## s(fixed_acidity).9
## -3.132917e-02
summary(gam_mod)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) +
## s(citric_acid) + s(fixed_acidity)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.63487 0.01713 328.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(alcohol) 3.799 4.762 54.064 < 2e-16 ***
## s(volatile_acidity) 1.003 1.006 54.789 < 2e-16 ***
## s(Log_sulphates) 3.691 4.634 23.398 < 2e-16 ***
## s(Log_chlorides) 1.000 1.001 11.922 0.000572 ***
## s(pH) 2.252 2.906 5.016 0.001882 **
## s(Log_total_sulfur_dioxide) 2.943 3.724 3.550 0.008809 **
## s(citric_acid) 1.991 2.508 2.440 0.062463 .
## s(fixed_acidity) 2.553 3.260 1.909 0.134935
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.404 Deviance explained = 41.3%
## -REML = 1221 Scale est. = 0.37532 n = 1279
gam_mod2 = gam(quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates),
data = train, method = "REML")
summary(gam_mod2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.63487 0.01749 322.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(alcohol) 3.958 4.951 60.74 <2e-16 ***
## s(volatile_acidity) 1.001 1.002 88.52 <2e-16 ***
## s(Log_sulphates) 4.191 5.202 19.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.379 Deviance explained = 38.3%
## -REML = 1231.4 Scale est. = 0.39111 n = 1279
plot(gam_mod2, residuals = TRUE, pch = 1, shade = TRUE, shade.col = "lightblue")
gam.check(gam_mod2)
##
## Method: REML Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [-2.529408e-05,1.537895e-06]
## (score 1231.431 & scale 0.3911066).
## Hessian positive definite, eigenvalue range [2.519977e-05,637.5075].
## Model rank = 28 / 28
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(alcohol) 9.00 3.96 1.03 0.830
## s(volatile_acidity) 9.00 1.00 0.96 0.065 .
## s(Log_sulphates) 9.00 4.19 0.97 0.140
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
https://noamross.github.io/gams-in-r-course/chapter4
Lo primero de todo, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”)
train_gam <- train %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
train_gam
## # A tibble: 1,279 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.1 0.48 0.28 0.997 10.3 3.24
## 2 7.6 0.49 0.33 0.997 9 3.3
## 3 5 1.02 0.04 0.994 10.5 3.75
## 4 7.6 0.43 0.29 0.997 9.5 3.4
## 5 6.8 0.59 0.1 0.996 9.7 3.3
## 6 6.8 0.815 0 0.995 9.8 3.3
## 7 8.5 0.21 0.52 0.996 10.4 3.36
## 8 7.4 0.36 0.29 0.996 11 3.3
## 9 5.5 0.49 0.03 0.991 14 3.3
## 10 6.8 0.49 0.22 0.994 11.3 3.41
## # … with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_gam$nota_vino)
##
## 0 1
## 597 682
Para aplicar GAM logstico a nuestro problema, utilizamos el paquete mgcv y la familia=binomial que indica a la función GAM que nuestra variable respuesta será 0 o 1, es decir, vino bueno o vino malo. Las variables están envueltas por la función s, que es una función que espeficia que queremos que la relación sea flexible.
gam_mod_log = gam(nota_vino ~ s(alcohol) + s(volatile_acidity) +
s(Log_sulphates) + s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) +
s(citric_acid) + s(fixed_acidity), data = train_gam, method = "REML",
family = binomial)
summary(gam_mod_log)
##
## Family: binomial
## Link function: logit
##
## Formula:
## nota_vino ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) +
## s(citric_acid) + s(fixed_acidity)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.30934 0.07799 3.966 7.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(alcohol) 5.846 6.974 143.805 < 2e-16 ***
## s(volatile_acidity) 1.855 2.377 27.641 3.25e-06 ***
## s(Log_sulphates) 2.484 3.169 61.194 < 2e-16 ***
## s(Log_chlorides) 6.015 7.183 18.378 0.0121 *
## s(pH) 2.416 3.119 6.121 0.1115
## s(Log_total_sulfur_dioxide) 3.741 4.706 28.623 2.87e-05 ***
## s(citric_acid) 1.490 1.828 6.633 0.0191 *
## s(fixed_acidity) 2.535 3.226 6.556 0.1200
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.371 Deviance explained = 32.2%
## -REML = 642.13 Scale est. = 1 n = 1279
Nos quedamos solo con las variables más significativas (tres ***).
gam_mod_log2 = gam(nota_vino ~ s(alcohol) + s(volatile_acidity) +
s(Log_sulphates) + s(Log_total_sulfur_dioxide), data = train_gam,
method = "REML", family = binomial)
summary(gam_mod_log2)
##
## Family: binomial
## Link function: logit
##
## Formula:
## nota_vino ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_total_sulfur_dioxide)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26735 0.07471 3.579 0.000345 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(alcohol) 5.927 7.058 150.80 < 2e-16 ***
## s(volatile_acidity) 1.049 1.095 34.16 < 2e-16 ***
## s(Log_sulphates) 4.078 5.044 54.78 < 2e-16 ***
## s(Log_total_sulfur_dioxide) 3.557 4.487 35.85 5.63e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.346 Deviance explained = 29.4%
## -REML = 649.89 Scale est. = 1 n = 1279
Nuestro intercept es de 0.26735 y utilizamos la función plogis() para transformar nuestro intercept en una probabilidad.
plogis(0.26773)
## [1] 0.5665355
plogis(coef(gam_mod_log2)[1])
## (Intercept)
## 0.566441
Este valor significa que nuestro modelo predice una probabilidad inicial de aproximadamente 57% de un vino ser bueno.
plot(gam_mod_log2, residuals = TRUE, pch = 1, shade = TRUE, shade.col = "lightblue", trans = plogis, shift = coef(gam_mod_log2)[1],
seWithMean = TRUE, col = "purple")
#predict(gam_mod_log2, type="response", se.fit = TRUE)
#plogis(predict(gam_mod_log2, type="link"))
Probar en test?¿
test_gam <- test %>%
mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
mutate(quality = NULL)
test_gam
## # A tibble: 320 × 12
## fixed_acidity volatile_acidity citric_acid density alcohol pH
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 0.998 9.4 3.51
## 2 7.3 0.65 0 0.995 10 3.39
## 3 8.9 0.22 0.48 0.997 9.4 3.39
## 4 7.6 0.41 0.24 0.996 9.5 3.28
## 5 7.1 0.71 0 0.997 9.4 3.47
## 6 5.7 1.13 0.09 0.994 9.8 3.5
## 7 7.3 0.45 0.36 0.998 10.5 3.33
## 8 8.1 0.66 0.22 0.997 10.3 3.3
## 9 6.8 0.67 0.02 0.996 9.5 3.48
## 10 5.6 0.31 0.37 0.995 9.2 3.32
## # … with 310 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## # Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## # Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(test_gam$nota_vino)
##
## 0 1
## 147 173
Explicando los predictores
head(predict(gam_mod_log2, type = "terms"))
## s(alcohol) s(volatile_acidity) s(Log_sulphates) s(Log_total_sulfur_dioxide)
## 1 -0.153936041 0.11651423 -0.54774301 0.15593321
## 2 -0.891143170 0.09074145 -0.24318835 -0.49397254
## 3 0.007793508 -1.25421557 -0.01908907 -0.49397254
## 4 -1.063371315 0.24564085 0.09246652 -0.13861017
## 5 -0.910318840 -0.16603197 0.25861731 0.09886004
## 6 -0.792699732 -0.73698784 -0.68971449 0.37393429
predict(gam_mod_log2, type = "terms")[1, ]
## s(alcohol) s(volatile_acidity)
## -0.1539360 0.1165142
## s(Log_sulphates) s(Log_total_sulfur_dioxide)
## -0.5477430 0.1559332
plogis(sum(predict(gam_mod_log2, type = "terms")) + coef(gam_mod_log2)[1])
## (Intercept)
## 0.566441
gam_mod_log2_test = gam(nota_vino ~ s(alcohol) + s(volatile_acidity) +
s(Log_sulphates) + s(Log_total_sulfur_dioxide), data = test_gam,
method = "REML", family = binomial)
summary(gam_mod_log2_test)
##
## Family: binomial
## Link function: logit
##
## Formula:
## nota_vino ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) +
## s(Log_total_sulfur_dioxide)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2339 0.1332 1.757 0.0789 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(alcohol) 1.000 1.000 25.605 7.28e-07 ***
## s(volatile_acidity) 1.000 1.000 18.279 1.93e-05 ***
## s(Log_sulphates) 2.309 2.948 7.506 0.0627 .
## s(Log_total_sulfur_dioxide) 2.595 3.265 10.146 0.0231 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.257 Deviance explained = 22.3%
## -REML = 180.54 Scale est. = 1 n = 320
plogis(coef(gam_mod_log2_test)[1])
## (Intercept)
## 0.5582201
plot(gam_mod_log2_test, residuals = TRUE, pch = 1, shade = TRUE, shade.col = "lightblue", trans = plogis, shift = coef(gam_mod_log2_test)[1],
seWithMean = TRUE, col = "purple")
Regresión Lineal Mútiple: Vemos una falta de adecuación y ajuste del modelo de regresión lineal múltiple obtenido. Se observa un modelo con unos residuos que presentan heterocedasticidad (varianza no constante en el modelo - se viola la homocedasticidad) y que además no predice de forma adecuada la variable respuesta o dependiente, en base a las variables explicativas o independientes. Al tener una variable dependiente como “quality” que es discreta, un modelo de regresión linela normal no se ajusta a nuestros datos.
Reducción de la Dimensionalidad (PCA y t-SNE):
Regresión Logística:
KNN:
Decision Tree:
Métodos de Ensamble:
Random Forest:
Ajuste de Hiperparámetros:
Clustering (K-means. Jerárquico y HDBSCAN):
GAM: